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ABSTRACT 


This thesis implements a cyclostationary estimation technique called the time- 
smoothing FFT accumulation method on a reconfigurable computer to generate a 
frequency vs. cycle frequency approximation of the input signal. This signal processing 
method can be used to identify signal modulation type and extract the parameters of low 
probability of intercept signals in electronic intelligence discrimination receivers. This 
implementation builds on previous work at the Naval Postgraduate School and focuses on 
reducing the overall runtime to approach real-time processing. The focus of the 
implementation is to utilize dual field programmable gate arrays (FPGAs) within a single 
multi-adaptive processor (MAP). Hardware decisions are made by analyzing the 
relationships between frequency resolution, Grenander’s Uncertainly Condition and 
desired cycle frequency resolution. Implemented on the SRC-6 reconfigurable computer 
utilizing Xilinx Virtex 2 FPGAs, this work uses the cyclostationary algorithm and takes 
advantage of the techniques for which the SRC-6 is optimized, such as pipelining, array 


processing and memory access techniques. 
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EXECUTIVE SUMMARY 


Advanced missiles and radars are employing low probability of intercept (LPI) 
techniques to appear nearly invisible to adversary intercept receiver systems. An 
autonomous system is under development at the Naval Postgraduate School that analyzes 
LPI radar signals. Repeatable, accurate, timely classification benefits electronic 
intelligence platforms, ships, airplanes or any military combat system. The basic design 
of this autonomous system is comprised of a collection of advanced algorithms that 
process a signal and pass along the signal modulation and the extracted signal parameters 
to a decision module that classifies the signal. The algorithms currently under 


consideration for this system are the 


° Choi-Williams Signal Processing; 
e Quadrature Mirror Filtering; 
e Cyclostationary Signal Processing. 


This thesis is a continued investigation into the cyclostationary signal processing 
algorithm as described in [1] and implemented in [2]. The time smoothing FFT 
accumulation method (FAM) was previously implemented onto a single field 
programmable gate array (FPGA) on a single multi-adaptive processor (MAP) onboard 
the SRC-6 [2]. This thesis shows that by utilizing pipelining, array processing, and 
memory accessing techniques, the overall runtime is significantly reduced to improve the 
usability of the cyclostationary FAM algorithm implemented in [2]. Consequently, the 
iteration runtime approaches a value that allows the cyclostationary FAM technique to be 
used in a real-time electronics discrimination receiver. To achieve better performance, 
both FPGAs on a single MAP and the data pipeline are optimized. A reduction in 
memory dependencies and the development of a memory management plan are two 
fundamental techniques that are employed to achieve a significant reduction in runtime 
by an order of 10 times the original FAM algorithm implemented in [2]. These 
techniques are not specific to the SRC-6 or Virtex FPGAs. They are techniques that can 
be applied on any FPGA. 


xiii 


Two types of LPI radar signals were used to investigate the algorithm 
performance. Exploring both Frank continuous phase modulation and frequency 
modulated continuous waveform signals generated in MATLAB (using code from [1]), 


hardware decisions were made when implementing the FAM in hardware. 


The FAM algorithm consists of first windowing the data and performing an N' 
point FFT. The output data is split and modulated before being multiplied together. A 


second P point FFT is used to get the output spectral correlation density function. 


When the sampling period is chosen by the selection of the analog-to-digital 
conversion hardware, the frequency and cycle-frequency resolution are set based upon 
the chosen Grenander’s Uncertainty Condition. This also sets the FFT sizes that are used. 
The product of FFT sizes greater than 1024 will yield sufficient resolution in the 
frequency and cycle-frequency axis to give meaningful results. This decision tool assists 
the designer of a cyclostationary signal processing FAM algorithm to utilize the available 


resources to provide maximum flexibility when designing intercept receivers. 


The FAM algorithm is used to identify the LPI radar modulations under nominal 
signal-to-noise ratios (0 to -6 dB). In addition, the extraction of the signal parameters is 
also important for emitter classification. Careful examination of LPI signals force choices 
in hardware. Smart implementation of pipelines, memory management plans, and array 
utilization on the SRC-6 continue to improve upon the usefulness of the cyclostationary 


FAM algorithm. 
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I. INTRODUCTION 


A. LOW PROBABILITY OF INTERCEPT RADARS 


Low probability of intercept (LPI) radars is a useful tool in the art of modern 
warfare. Military planners find them useful and despise their use by the opposition. By 
design, LPI radars are difficult to detect either because of transmitted energy levels or by 
using advanced techniques of signal modulation. The threat alone and the capabilities of 
these LPI radars is justification enough to develop techniques to detect and analyze LPI 
radars. Countries that develop capabilities that successfully discriminate between LPI 


waveforms have the upper hand in the theater of electronic intelligence. 


Since the early 1980s, engineers and mathematicians have been working to 
implement methods to be able to successfully discriminate between LPI waveforms. 
There are several mathematical techniques that allow an operator to successfully 
‘classify’ signals. Each algorithm is good for certain signal types. No one algorithm is 
good for all types of signal modulations and extraction of signal parameters. The need 
for a system to cover this task completely is a strong requirement. Professor Phillip Pace 
and his colleagues have been working to develop, model, and test such a system at the 
Naval Postgraduate School in Monterey, California. A block diagram of the proposed 
system is shown in Figure 1. This thesis is a continued investigation into the 
cyclostationary signal-processing block, which is a part of the larger project to create an 
autonomous system to allow for classification of non-cooperative emitters. For an in 
depth discussion of the other blocks, see [1]. Ultimately, the goal for the cyclostationary 
block is to extract accurate and reliable information about a signal that can be compared 


with other blocks to make autonomous decisions about a signal of interest. 


Decision Processing Classification 













































A 
( nw ae aa ~3y 
Choi- Williams Image Non-Linear 
Signal Processing Analysis Processing 
Non-linear 
Digital Quadrature Mirror Image Non-Linear Processing and 
Receiver Filtering Analysis Processing Modulation 
Decision 
Cyclostationary Image Non-Linear 
Signal Processing Analysis Processing 

















i i 
| Parameter | 
| Extraction ! 
' ' 


lowcesoekeseen 





Modulation, bandwidth, 
frequency, etc. 


Figure 1. | Autonomous LPI System (From: [1]) 


B. OBJECTIVE 


The objective of this thesis is to develop a functional program that implements the 
cyclostationary time smoothing fast Fourier transform accumulation method (FAM) 
algorithm found in [1], starting from the previously developed single field programmable 
gate array (FPGA) implementation in [2]. Two adjacent FPGAs will be utilized with the 
major benefit being twice the amount of available logic. The ultimate goal of this project 
is to reduce the overall runtime of the program to approach near real time operations. A 


secondary goal is to extract data at sufficient resolutions to provide meaningful results. 


As a result of analyzing LPI signals, certain decisions are reached that help 
determine specifications for hardware and software. These decisions are the result of 
developing an understanding of the interrelationship between the hardware and the 
signals analyzed. Future designers can make smart software choices based upon the 
selection of data acquisition hardware and a phenomena called Grenander’s Uncertainty 


Condition. 


The cyclostationary block of Figure 1 has several tasks that must be accomplished 


for a successful project. Digesting the intricacies of the FAM algorithm is key to 
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developing an efficient implementation onto any platform. Second, the algorithm must be 
tailored for the target platform. This is accomplished by taking advantage of functions 
that the target platform does well and employing known techniques for maximum 
efficiency. The target platform for this thesis is the SRC-6 reconfigurable computer. In 
particular, the target is the E series map that has two user-programmable Xilinx Virtex II 


FPGAs onboard. 


C; RELATED WORK 


As part of a larger project, this thesis continues the work and expands upon [2]. 
Upperman’s use of a single FPGA implementation of the Cyclostationary algorithm 
describes several space and speed limitations that limit the useable data because of 
inadequate resolutions in both frequency and cycle-frequency. To summarize 
Upperman’s work, he was able to successfully implement four versions of the FAM 
algorithm. The C and MATLAB models performed well and are the target of 
performance for this thesis. The SRC implementation is fast, but it is not as fast as the C 
and MATLAB models that have “unlimited memory” and a processor in the Gigahertz 
range. Table 1 is the timing summary of the single chip custom FFT implementation. 
The resulting frequency and cycle-frequency resolution are not sufficient to determine 
anything more than the center frequency and bandwidth approximation of the signal of 


interest. The code rate (R,) is not discovered because the frequency and cycle-frequency 


resolutions are insufficient. 


Table 1. Single Chip Timing Results 


Execution times: 
2.170836 seconds total 
Of the total time: 
seconds were spent on the CALLS to FFTs 
0. seconds were spent on the MAP for FFTs 
Of the time spent on the MAP for FFTs: 

0.382998 seconds were spent on DMAs 

0.075691 seconds were spent in the FFT loop 
0.116433 seconds were spent on the CALLS to Channelize 
0.000157 seconds were spent on the MAP for Channelize 
Of the time spent on the MAP for Channelize: 

0.000024 seconds were spent on DMAs 

0.000133 seconds were spent channelizing 
0.106429 seconds were spent on the CALLS to Downconvert 
0.003294 seconds were spent on the MAP for Downconvert 
Of the time spent on the MAP for Downconvert: 

0.000730 seconds were spent on DMAs 

0.002564 seconds were spent downconverting 
Execution time not including calls and data transfers: 0.250462 
seconds 
Results from Cyclostationary FAM algorithm written to: 
FAM result.txt 






























This thesis is only a small block of the proposed system in Figure 1. Other theses 
cover the other algorithms [3], [4], [5]. After purchasing the SRC-6 reconfigurable 
computer, the Naval Postgraduate School began testing the suitability of this architecture 
for developing algorithms to discriminate LPI signals [6], [7]. To date, all work has 
shown that the SRC-6 and newer variants would be suitable for the advanced signal 
processing needed to develop the Autonomous system described by Figure 1. Upperman 
and Macklin agree that the SRC-6 is a difficult system to master but has great potential 


for signal processing algorithms. 


D. THESIS ORGANIZATION 


This thesis is organized by chapters that develop the project from concept to code 


and is laid out as follows: 


e Chapter II provides information on low probability of intercept systems, 


cyclostationary spectral analysis, and Grenander’s Uncertainty Condition. 


e Chapter II provides information on the SRC-6 reconfigurable computing 


including both hardware considerations, and software considerations. 
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Chapter IV provides information on the SRC-6 specific techniques to 
include memory implementation, loops and dependencies, arrays, parallel 


regions, and using two logic chips. 
Chapter V provides information on the Dual Chip flow through design. 
Chapter VI provides the Timing analysis. 


Chapter VII provides information on Achieving Usable Data and Future 


direction. 
Chapter VIII provides a conclusion for this thesis. 


Appendix A is a datasheet that shows the subtle differences between 
MAPs that are compatible with the SRC-6. 


Appendices B-F provides the Dual Chip Flow Through Design code. Code 


is broken up into various Appendices to act as markers. 
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I. LOW PROBABILITY OF DETECTION SYSTEMS 


An LPI radar is defined as a radar that uses a special emitted waveform intended 
to prevent a non-cooperative intercept receiver from intercepting and detecting its 
emission [1]. LPI is also described as a property of a radar that because of its low power, 
wide bandwidth, and frequency variability makes it difficult to be identified by passive 


intercept receivers [1]. The threat of LPI radar is real and growing. 


For example, designed to replace the popular P-18 radar, the Vostok-E is new up 
and coming mobile radar that utilizes several cutting edge technologies that define it as 
an LPI radar. Capable of using low-power noise-like probing signals it provides reliable 
protection against anti-radiation missiles (ARM) [8]. Designed in Belarus, this radar is 
designed to detect air contacts, measure their range, azimuth and range rate utilizing a 
solid-state 2D digital VHF radar. Targets are tracked in two-dimensional space, 
automatically classified, and integrated into networked command and control equipment. 
Complete with an integrated diesel generator this unit is an asset to any anti-air 
campaign. What makes this radar so dangerous is that it is hard to detect its emissions. In 
addition, it boasts enhanced jamming immunity because it employs wide dynamic range 
radio receiving devices and pulse-by-pulse automatic carrier frequency tuning [8]. Figure 


2 is a picture of the Vostok-E in combat mode. 





Figure 2. “Vostok-E” Mobile Solid-State 2D Digital VHF Radar (From: [8]) 


A subsonic LPI missile is currently under development by Saab Bofors Dynamics. 
The next generation RBS-15 will be a LPI version of the successful RBS-15 MK3. To 
make this missile LPI, the design engineers are designing the seeker using frequency- 
modulated continuous wave spread-spectrum technology [9]. The signals used by this 
radar are similar to the type analyzed in Chapter I section B. Along with LPI radar, an 
imaging IR seeker will be used. Another advanced feature considered for integration into 
this missile is a two-way data link for updating targeting data [9]. See Figure 3 for a 
diagram of the RBS-15. 
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Figure 3. RBS 15 mk3 Missile (From :[9]) 
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A. CYCLOSTATIONARY SPECTRAL ANALYSIS (CSA) 
1. Cyclostationary Definition 


Cyclostationary spectral analysis (CSA) is based on modeling the signal as a 
cyclostationary process rather than a stationary process. A signal is cyclostationary of 
order n, if and only if, one can find some nth order nonlinear transformation of the signal 
that will generate finite-strength additive sine wave components that result from spectral 


lines [1]. 


CSA is a valuable tool in LPI analysis because of its ability to show the user, the 
modulation type and the parameters for many LPI signal types. Bandwidth (B), center 


frequency (f.), code rate (R.), and modulation period (¢,) are four parameters of the 


signal of interest that can be extracted. Another benefit of CSA is that it is able to 
perform well in signals with added noise. This thesis will show in a later chapter that 
signals with signal-to-noise ratios up to -6db of noise can be analyzed and all parameters 
can still be extracted. CSA performs well on the following types of signals 

e Binary Phase shift keying (BPSK); 

e Frequency Modulation Continuous Waveform (FMCW); 

e Frank Code — (Polyphase Code). 


This thesis will show, by example, how the operator using CSA processing can 
extract parameters for FMCW and Frank code signals. One drawback of the CSA is that 
it is not able to extract any time-dependence information. Time information is lost during 
the transformation into the frequency-cycle frequency domain. For example, the CSA 
cannot determine a frequency-hopping signal’s order of frequencies. Time information is 
justification of the need for a secondary set of time-frequency algorithms such as Wigner- 
Ville, Choi-Williams, or Quadrature Mirror Filter Bank techniques, as indicated in Figure 


1. 


2, Time-Smoothing FFT Accumulation Method (FAM) 


One efficient method of implementing a hardware computation of the 
cyclostationary spectrum is the time- smoothing FFT accumulation method. Equation 


(1.1) from [1] is the mathematical representation of Figure 4. The CSA can be written as 





S7,.(n N=2 >] me ann easy cen ee 

fe SNR te een 

where (1.1) 
N'-1 

Xy(n,k) = SY) w(n)x(nje (P72 
n=0 


k is the frequency (discrete); 

N is the total number of discrete samples in the observation; 
N' is the number of points in the discrete (sliding) FFT; 

w(n) is the windowing function, (a Hamming window is used); 
x(n) is the sampled complex-valued signal; 

X,,.is the complex conjugate of Xy-; 

y is the cycle frequency (discrete). 


Note this form shows the CSA as a spectral correlation. 


3. Data Path 


The time-smoothing FFT accumulation method, is shown in Figure 4, and is 
implemented in [2] utilizing the SRC-6 and is the starting point for this thesis. The data 


path as described in [1] has three stages known as: 
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e Computation of complex demodulates; 
o Data tapering (Hamming Window) 
o Sliding N' point Fourier transform 
o Baseband frequency translation 

e Computation of product sequences; 


e Smoothing of product sequences (P point FFT) 


_ j2akn 








~ point 
Input j - second 


AXi(n) Window 









Figure 4. Data Path (From: [1]) 


B. GRENANDER’S UNCERTAINTY CONDITION 


Grenander’s Uncertainty Condition (M) is a relationship that when followed will 
keep the ratios of frequency and cycle frequency aligned to gives the best output. 
Equations (1.2) (continuous time) and (1.3) (discrete variables) show that as long as the 
ratio of Af / Aq is relatively larger than one that Grenander’s Uncertainty Condition will 
hold. Table 2 is a visual representation that shows the linear affect of a chosen M given a 
Af . Since Af is normally chosen by the operator as a function of the data acquisition 
process, the only remaining variable to choose is M. The result is the corresponding Aa. 
This decision tool can be implemented in the software that runs a CSA. 

M =(Af /Aa)>>1 (1.2) 
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M =(N/N')>>1 (1.3) 


Table 2. | Cycle Frequency Resolution for different M based on Af . 
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1. Effects of M on FMCW and Frank Signals 


To help show how a chosen M affects the ability to extract important information 
from the signal of interest, experiments were conducted using MATLAB software from 
[1]. Two signals were examined without noise to show how a given M affects the detect 
ability of desired parameters. This experiment was repeated for several frequency 


resolutions. Studying frequency vs. cycle frequency plots, one learns that there are 
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usually four main masses on any given plot. Approximately one on each of the main axis 
(0°, 90°,180°, 270°). Observe Figure 5 for a general example. For clarity and consistency, 
each of the plots (Figures 5 through 29 excluding Figure 18) represents a close-up view 
of the mass at 0°. Looking at the 90° view (or 270°) also yields accurate information 
about the center frequency of the signal. The views at 0° and 180° are twice the center 
frequency. Bandwidth is another parameter that is extractable from these graphs and is 
determined by measuring the strongest portion of the data. As resolution is improved, 
(smaller numbers are best) the estimates from the plots become more accurate. The exact 
parameters of the signal under investigation are known; implying that data comparison 
and extraction should be easy. In real life, this can be troublesome because the exact 


signal parameters may not be known. 

The code rate is one parameter that is unique to Cyclostationary Algorithms. The 
code rate (R.) gives way to other parameters based upon signal of interest. For FMCW 
signals R, =1/2t,,. For Frank code signals R. =1/(N.t,) where N. is the number of 


subcodes and ft, is the subcode period. The next two sections will review several plots to 


help the reader become familiar with the affects of Af, R., f., and B given M for a 


selected frequency resolution (Af ). 


a. FMCW 


FMCW signals have three directly measurable parameters when viewing 
the cycle frequency vs frequency plots. For a given FMCW signal, the observer should 


be able to extract modulation bandwidth (AF ), center frequency (f.), and code rate 
(R.). The first signal under investigation has the following parameters, as created by the 


LPIT toolbox found in [1]: 
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Table 3. FMCW Parameters 




















[SignalName ————S*d F._1._ 7.25020 matt 
Signal Type FMCW 
Center Frequency ( f. ) 1 kHz 





Sampling Frequency ( f, ) 7 kHz 
Modulation Bandwidth (AF ) | 250 Hz 
Modulation Period ( t,, ) 20 ms 

Noise Added None- Signal only 























Figures 5 through 17 show that for this FMCW signal, a minimum of 
Af = 64 Hz and M = 8 should be used to measure all desirable parameters. This 


combination also sets N to 1024 or greater. The examples in this thesis suggest that an NV 


of 1024 or greater yields the best resolution combinations. 
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Figure 5. | FMCW signal default view. Shows four lobes 
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Figure 7. 
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Figure 9. 
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Figure 10. 
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Time Smoothing SCD F 179900, df = 64, N = 1024 

















(ZH) Aouenbai4 


1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 


1750 


Cycle frequency (Hz) 


64, N = 1024 


Time Smoothing SCD F 1 7590,0,, df 












































(ZH) Aouanbal4 


2175 2180 2185 2190 2195 2200 2205 
Cycle frequency (Hz) 


2170 


Able to measure all parameters 


=8 


Af= 64M 


Figure 11 
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Figure 12. 
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Time Smoothing SCD F 47790,0,, df = 32, N = 1024 
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Figure 13. 
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Figure 14. 
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Figure 16. 
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Time Smoothing SCD F 47790,0,, df = 8, N = 2048 





Frequency (Hz) 











Pop 
ees pee es 





L L L 
1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 
Cycle frequency (Hz) 


Figure 17. Af=8 M=72. Able to extract all parameters 


b. Frank Code Signal 


The Frank code signal will be analyzed in the same fashion as the FMCW 
was by choosing Af and M and attempting to detect the various parameters. For the Frank 


Code, the observer should be able to extract bandwidth (B), center frequency ( f.), and 
code rate (R,). The number of subcodes (N,) can be calculated because R, and B are 


measurable as described by the equation: 


B 
N= — 1.4 
= (1.4) 


Cc 


Table 4. Frank Parameters 






































Signal Name FR 1 7 8 1 s.mat 
Signal Type Frank 

Center Frequency (f, ) 1 kHz 

Sampling Frequency (f; ) 7 kHz 

Number of Phase Codes ( N. ) 8°= 64 

Cycles per phase (cpp) 1 

Noise Added None- Signal only 
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Figures 18 through 29 show that for this Frank signal, a minimum of Af = 


64 Hz and M = 8 should be used to measure all desirable parameters. This combination 
also sets N to 1024 or greater. This example signal also suggests that an N of 1024 or 


greater yields the best resolution combinations. 
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Figure 18. Frank code entire signal showing all 4 lobes. 
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Figure 20. 
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Figure 21. 
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Figure 23. 
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Figure 25. 


Upper plot shows how B is also measurable on the Frequency axis. Not able to measure 


R. consistently 
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Figure 26. Af=32,M=4 Able to extract all parameters 
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Figure 28. Af=16, M=2. Able to extract all parameters 
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4. Able to extract all parameters 


Figure 29. 


1S1OnS 


Hardware Dec 


2. 


From observing the previous plots for the Frank and FMCW signals, hardware 


implementations of the CA FAM suggest to use Af and M combination that yields an N of 


1024 or higher. N is described as: 


34 


N=PL (1.5) 
_Af 


Aa - (1.6) 
L= pow2(nextpow2(f, / Af)/4) (1.7) 
P = pow2(nextpow2( f, / Aa/L)) (1.8) 


Using a value of N that is greater than 1024 yields frequency cycle-frequency 
plots yielding resolutions that are sufficient to extract all parameters that will allow an 
operator to identify and rebuild the signal of interest. Frequency resolution was set by 
hardware when the sampling frequency of 7000 Hz was chosen. This sampling frequency 
is used for consistency from the previous thesis [2] and the assumptions made for the 
MATLAB code found in [1]. Cycle-frequency resolution is set as a result of Af and M. 
This suggests that a minimum of 1024 points are needed to get sufficient resolution in 
both cycle-frequency and frequency. This allows the engineer to pick values for the two 
FFTs based upon size and speed limitations of the targeted hardware. This procedure 
should also be repeated for other modulation types to help validate the usefulness of this 


technique with the proposed minimum settings for Afand M. 


C; NOISE EFFECTS 


The Cyclostationary FAM algorithm is noise resistant. Several examples below 
help validate this claim. Each signal is generated using LPIT from [1]. Figures 28 
through 31 are FMCW signals with noise levels of 0, -3, -6, and -10 dB respectfully. 
Careful observation by a trained operator suggests that all of the parameters are 
extractable. Looking at the -6 and -10 dB plots, it is difficult to extract but still able to be 


done. 
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Figure 33. Signal with -10dB noise. Rc is difficult to extract accurately 


D. SUMMARY 


The use of LPI signals and radar is on the rise throughout the world. Knowing 
this, Naval Postgraduate School is working to develop an autonomous system to reduce 


the decision time of classification and extraction of parameters from signals of interest. 


The time smoothing FFT accumulation method was chosen for this thesis based 
upon previous work. As part of the optimization of the previous algorithm developed in 
[2], this research discovered that Grenander’s Uncertainty condition helps decide 
minimum hardware needed to created the desired resolutions in both the frequency and 


cycle-frequency domains. The minimum product of the two FFT sizes is 1024. 


Validation of the robustness of the cyclostationary algorithm was accomplished 
by examining how noise affects a signal of interest using MATLAB code. Signals of 


interest as shown to be noise tolerant visually up to -6 dB of noise. 
39 


Examining the signals of interest was an important step during the initial research 
stage to allow important decisions to be made with respect to hardware choices. Software 


decisions are also influenced by knowledge of the cyclostationary FAM algorithm. 
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HI. RECONFIGURABLE COMPUTING 


A. SRC-6 


The SRC-6 is a reconfigurable computer designed by SRC Computers, LLC in 
Colorado Springs, Colorado. SRC developed a hardware and software architecture they 
call the IMPLICIT+EXPLICIT Architecture™, which fully integrates Dense Logic 
Device (DLD) technology and reconfigurable Direct Execution Logic (DEL). Systems 
built with this architecture execute user code, written in high-level languages such as C or 
FORTRAN, on a mixture of tightly coupled implicitly and explicitly controlled 
processors [10]. Figure 34 describes the high-level points of the implicit and explicit 
device differences. The implicit controlled device is a DLD that operates at a higher 
clock rate and is made up of fixed logic. In the case of the NPS SRC, it is an Intel 
microprocessor (uP). The Explicit controlled device is DEL hardware that runs at a 


lower clock rate (100Mhz) and is a reconfigurable FPGA. 


The DLD for the SRC is the Intel [A-32 line of microprocessors. The SNAP™ 
interface bridges the uP to the MAP by a 1400 MB/sec interface. This interface is one of 
the bottle necks of the system. 


Al 














Implicit Controlled Device Explicit Controlled Device 
Dense logic device (DLD) Direct execution logic (DEL) 
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Figure 34. Implicit + Explicit Architecture (After: [10] ) 


To aid the user in programming the SRC-6, SRC has developed the Carte™ 
Programming environment. The Naval Postgraduate School is currently using Carte 


version 2.2 as of this writing. Carte is further covered in Chapter III Section C. 


At the heart of this system is the MAP®, which is the SRC high-performance 
Direct Execution Logic processor. MAP, an acronym for Multiple Adaptive Processor, 
houses two user logic areas, associated memory, and a control processor. There are 
several ports to the ‘outside world’ allowing for interconnection with other MAPs. Figure 
35 shows the various data rates and components of a typical MAP. Appendix A compares 


the different variants of MAPs compatible with the SRC-6 architecture. 
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Figure 35. MAP E Overview 


The SRC-6 was chosen because it is expandable, flexible, and available for use at 
the Naval Postgraduate School. Learning to utilize the FPGA logic also has other 
applications to porting this project to other (newer) FPGAs. The unique SRC architecture 
has tremendous possibilities for a realizable, upgradable system, as shown in Figure 1, by 
interconnecting several MAPs and connecting radars as input peripherals and connecting 


monitors as output peripherals or possibly ship or aircraft early warning systems. 


B. HARDWARE 


With every design project, the hardware is analyzed, so that software may be 
optimized. This project is no different. The FPGAs within the E-Series map were 
chosen over the other available MAPs in the system because they have more available 
memory (BRAM) and logic slices. Optimally, this project would perform better given 
the newer F, G, or H series maps that have a faster clock, more slices and/or built in 
floating point hardware. Working with what is available, the software will be targeted to 


the E-Series Map Xilinx Virtex II pro xc2vp100 FPGAs. 
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1, Xilinx Virtex II pro~ xc2vp100-ff1696-5 


The model of FPGA used in the E-Series MAP is the xc2vp100-ff1696-5. The 
size of each FPGA is 42.5mm x 42.5mm [11]. The speed grade of these particular chips 
is -5 (indicating a 300 Mhz PowerPC Processor Block) [12]. Quick reference 
information about this FPGA is found in Table 5. A visual representation of the 


architecture is seen in Figure 36. 


Two major components make up the FPGA. The input/output blocks (IOBs) 
provide the interface between package pins and the internal configurable logic [12]. 
Internal configurable logic blocks (CLBs) are used to implement the configurable 


network of logic. CLBs have four major components. 
e Combinatorial and Synchronous logic 
e Block SelectRAM (dual-port RAM) 
e Dedicated Multipliers (18-bit x 18-bit) 
e Digital Clock Manager (DCM) 


All the components are able to connect via the General Routing Matrix (GRM). 
The GRM connects each programmable element together during compilation and allows 


for fast reprogramming of an FPGA. 
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Figure 36. Virtex II Architecture Overview (From: [12]) 
Table 5. = Virtex-II Pro XC2VP100 quick information (After: [12]) 
CLB (1=4 slices= max 128 bits Block Select RAM + 
Device Logic Cells - , oe i oe 18 Kb Max Block 
Slices Max Distr RAM (Kb) a Blocks RAM (Kb) 
XC2VP100 | 99,216 | 44,096 1,378 444 444 7,992 
2. Memory 


Available memory on each MAP comes in three types, On Board Memory, Block 
RAM, and global (common) memory. Of the memory available, OBM and BRAM are 


fixed sizes. Global memory may be upgraded by purchasing more memory and placing it 


into available slots. 


The hardest to use and most costly in terms of latency is global 


memory. Memory utilization and initialization is done by macros and is covered in a 


later section. 
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a. On Board Memory (OBM) 


Memory onboard the MAP is limited to six banks of 523776 - 64 bit 
words. Each bank also has 512 words available to hold scalars. The banks are named A, 
B, C, D, E, and F. When designing arrays the size must be known at compile time [10]. 


b. Block RAM (BRAM) 


BRAM is available to the programmer in the quantity of 444 units with 
each unit containing 2048bytes. BRAM allocation size will be the smallest table value 


that is able to contain the allocated size [10]. BRAM is fast dual ported memory. 


c. SOFTWARE 


Programming an SRC-6 reconfigurable computer is not for the novice with a 
project due this coming weekend. The system requires an understanding of several 
programming languages and an in-depth understanding of the hardware to take advantage 
of the architecture!. The compiling of the source files is done utilizing the SRC 
CARTE™ programming environment. The files of a project are the subject of this 


section. 


Complex and custom circuitry is one advantage of designing with this unique 
architecture. To do so, programming is done in a hardware description language such as 
Verilog or VHDL. Code for custom macros and the microprocessor can also be written 
in C or Fortran. These custom codes should be targeted for the MAP because the 
programmer can take advantage of the unique capabilities of reprogrammable logic. 
Creating a unique combination of custom circuitry and code is where the art is in utilizing 
this system. No one-stop programming book exists that describes how to approach 
designing code for the SRC-6. The best recommendation is to experiment and gain 
knowledge with the system. Develop as much code as you can in the language you are 
most comfortable with (ANSI C or FORTRAN) when coding the bulk of the program. 


The compiler does a decent job converting your code into circuitry to implement either 


! It is possible to only use one programming language to make a complete program. To take advantage 
of FPGAs precise circuit design should be utilized by employing Verilog or VHDL. 
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on the FPGA or on the microcontroller. Use Verilog to create custom macros for circuits 
that are difficult to program in C or can take advantage of hardware implemented on the 


MAP that is not on the uP. 


To best utilize the FPGAs, move any code that is designable with custom 
hardware to a macro. As mentioned earlier, another technique developed for this thesis is 
to avoid costly memory transfers (back and forth transfers). Also, move the adjoining 
functions that can be done as part of the logical thought progression. Build an assembly 
line of operations that need to be performed on multiple pieces of data. This process has 
no benefit if you are performing operations on only a few pieces of data. After all, there 
is a latency penalty for the downloading and uploading of the data to an FPGA. The 
compiler will attempt to pipeline as much as it can. This benefit allows for an increased 
throughput of data. This is why it is best to perform the chosen operations on several 


pieces of data. 


Data operations are not limited to logic gates; the pipelines can take advantage of 
successive mathematical operations as is done in the code of this thesis. The Fast Fourier 
Transform is one operation that is full of different mathematical computations. Applying 
Hamming Windows, down-converting, and multiplication are all operations in this thesis 


that also lend themselves to being pipelined. Moving them to the FPGA made sense. 


Throughout the development of the code for this and other projects, the most 
efficient method for designing the custom codes for macros is to develop Verilog code in 
a program such as Xilinx ISE or Active HDL. These programs have the added benefit of 
testing and simulating the code. If you purchase the appropriate versions of the software, 
you can even see how your code should perform on your targeted system. Xilinx ISE is 
especially helpful because it utilizes the exact software that the SRC-6 uses for compiling 


the executable code. 


The recommendation of this thesis is to design custom circuitry that you do not 
have a simple function call for on the microcontroller. In addition, any series of 
computations that lend themselves to a pipeline operation are great for the FPGA. To 


fully exercise the FPGAs and conceptualize the larger system of dedicating one MAP for 
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Cyclostationary Processing the entire algorithm was targeted for the two FPGAs. At the 
end of the project, only a fraction of the post-data processing was moved back to the 


microprocessor because of size restrictions. 


CARTE is best thought of as a complex compiler that takes all the possible types 
of input files and optimizes and links them together to create an application executable. 
To create the application executable, the linker brings together the MAP Compiler and 


uProcessor ‘.o files,’ called object files. CARTE runs on Fedora Linux. 


The MAP compilation process is best visualized by Figure 37. The steps of 
Parsing, CFG/DFG Generation, Optimizations, HDL Generation, and Place and Route are 
completed prior to creating the Unified Executable. HLL source files written in C, Map 
Macros, Customer Macros, and the runtime library comprise the different types of files 
that come together during compilation. This thesis utilizes all types of files. For the 
beginning user, it is simpler and more efficient to maximize the use of the runtime library 
and Map Macros. These files are already optimized for use and provide easy interfaces 
for use and ample documentation for implementation. Another popular option is to 
import intellectual property (IP) that has been created for the targeted hardware. Xilinx 


has a robust library that was considered for this thesis?. 


2 Consult Chapter V for discussion of possible use of IP for FFT implementation. 
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Figure 37. MAP Compilation process (From: [13]) 


D. SUMMARY 


The SRC-6 is a reconfigurable computer comprised of two user programmable 
areas (microprocessor and FPGAs). This flexible computing system allows for the 
creation of unique problem specific code that can implement hardware not found in the 
microprocessor. This software and hardware design platform has a steep learning curve 
that can be overcome via training and practice. Optimization of code for the SRC-6 is 
accomplished by adhering to techniques that reduce memory dependencies and take 


advantage of pipelining. 
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IV. SRC SPECIFIC TECHNIQUES 


This chapter outlines a few lessons that helped reduce overall latency. In 
summary, the simple introduction of another variable reduces latency at the cost of 
another array. Often times, common C programming practices end up having a negative 
effect when utilized on the SRC-6. Arrays are one example. Reviewing the technical 
documents [10] and [13], and working one chapter at a time from both to learn about the 
unique techniques, will save a lot of time during the debugging phase. Overall latency of 


a program can be significantly reduced if one adheres to the techniques. 


A. MEMORY IMPLEMENTATION 


Recalling from Chapter II, there are several types of memory available for the 
programmer to use. This section will begin to cover a few of the limitations in dealing 


with the different types of memories. 


1. On Board Memory 


OBM is arranged into six banks. Each bank has a unique name (A, B, C, D, E, 
and F). Memory can only be accessed in 64b (64 bits) words. Using macros split X_Y 
and combine _X_Y, a programmer can pack multiple data words into one memory 
address if the data elements are smaller than 64b. Where X indicates reduction parameter 
(64to4, 64to32, etc.) and Y is the data type combination (flt_flt, flt_int, ect). It is very 


important to efficiently utilize memory because the max OBM size is 523776 64b words. 


Declaring OBM memory is done at the top of a MAP routine. There are only 
three predefined data structures that can be utilized with OBM. One can declare an 
Array, a 2d Array, or two arrays in the same bank. See Section C, Arrays, for a technique 
to create as many arrays as desired in a single OBM bank. Scalar variables are better left 


to BRAM. 


OBM_BANK_A (IN, double, 100) 
OBM_BANK_B_2D (OUT, double, 30, 20) 


OBM_BANK_C_2_ARRAYS (A, double, 1000, B, int64_t, 900) 
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The first declares an array named IN made up of 100 elements with data type 
double in OBM bank A. The second declares a 30 x 20 2 dimensional array named OUT 
in bank B of data type double. The third declares two arrays in the same bank. A is an 
array of doubles of 1000 elements. B is an array of 900 elements of data type int64_t (64 
bit interger). 


To move data into and out of an OBM (to or from the microprocessor) requires 
two specific macros. DMA_CPU does the moving either to or from OBM memory and 


wait_DMA is a stall function to make the program wait till the data is transferred. 


DMA_CPU (<direction>, <OBM adr>, <OBM stripe>, <CM adr>, <CM stride>, 
<length>, <server>); 


wait_DMA (<server>); 


<direction> Either CM20BM or OBM2CM, specifies direction of data transfer 
<OBM adr> Start address in OBM 

Stripe/Stride address. Use MAP_OBM_stripe (1,”X”) where X is a 

combination of banks being striped across. 








<OBM stripe> 


<CM adr> CPU address 
<CM stride> Stride for the CPU address 
<length> Transfer length in bytes. 
<server> Server number in the range zero to eleven. 





2. Block RAM 


This type of local memory is both flexible and fast. Existing within the FPGA, 
this is the preferred type of memory for variables and small arrays. BRAM can be made 
into any data type that is standard within the C language. Most often used in this thesis 
are integer (int), float (32 bit type), and double. As previously mentioned, there is a 
capacity of 444 BRAM units. Each unit consists of 2048 bytes. Declaring BRAM 


variables is as simple as they are in C. Here are a few examples: 


float IN; //declares 32 bit floating point variable named IN 
double OUT1[100]; //declare double named OUT1 with 100 elements 





int i,j,k; //declares 3 variables for loops 
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B. LOOPS & DEPENDENCIES 
1. Loop-Carried Memory Dependency 


Loop-Carried Memory Dependencies happen when a (for) loop writes to a 
memory location during one iteration and then reads it the next iteration. To compensate, 
the loop is slowed down to insure that a write in one iteration takes effect before the read 
occurs in the following iteration [10]. Table 6 shows one example of memory 
dependencies and how to fix it by introducing an extra variable to remember the previous 
value. This technique appears to be wasteful in memory but it will save a significant 


amount of time, two clock cycles per iteration run. 


Table 6. | Memory Dependency Example (From: [10] ) 








Memory Dependant No Dependencies 
Prev=a[0]; 
for (i=1; aren i++) { for (i=1; i<n; itt) { 
temp= a[i-1]+J; temp= prevtj; 
a[i]=temp; a[i]=temp; 
} prev=temp; 
} 














2. Multiple Accesses to the Same Memory Bank 


Use of OBM can be troublesome when accessing arrays. Accessing multiple 
arrays from the same bank causes the loop to slow down to allow for the reading/writing 
of values. Each extra read or write will cause a penalty of two clock cycles per read or 
write. The only way a loop will be fully pipelined is if there is only one memory access 
in the loop body. All other memory accesses add two clocks per iteration. Table 7 shows 
an example of a multiple access problem and how to work around it. Another way to 
work around this problem is to break up where the data is stored. If the data is stored in 
three BRAM arrays then the problem of reading the same array three times for one loop 


is avoided. 
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Table 7. Multiple accesses to same memory bank ( After: [10] ) 














Multiple Accesses to same memory Work around 
for (i=0; i<n-2; itt) for (i=0; i<n; itt) { 
b[i]J= a[i]+ a[itl]+ a[i+3]; a0Q=al; 
al=a2; 
a2=al[il; 


if (n>=2) then 
b[i-2] =a0t+al+a2; 














C. ARRAYS 


Program writing can become difficult when the number of needed arrays becomes 
greater than twice the number of OBMs, especially when one is trying to work through 
techniques such as those found in Chapter IV, Section B. Creating multiple 2-D arrays in 
a single OBM bank is not defined by Carte and to do so requires the use of pointers. The 
method is demonstrated in Table 8, which shows how several two-dimensional arrays can 
be implemented in one OBM. This technique is difficult to manage but can be very 
helpful when several two-dimensional arrays are needed and a programmer has run out of 
OBM banks to store data in individually. Again, be cautious of creating multiple accesses 
to the same memory array. 


Table 8. | Multiple 2d Arrays in one OBM (From: [14]) 





OBM_BANK_ A (AL, int64_t, MAX_OBM_ SIZE 
int arrayl,array2, offset; 





— 


arrayl = 0; 
array2 = 200000; 


for (i=O;i<nrow;it+t) { 
for (4j=0; j<ncol; j++) { 
offset = i*ncol + j; 
valuel = AL[arrayl + offset]; 
} 
} 

for (i=O;i<nrow;it+t) { 
for (4j=0; j<ncol; j++) { 
offset = i*ncol + j; 
value2 = AL[array2 + offset]; 
} 








} 
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D. PARALLEL REGIONS 


Parallel Regions operate two independent sections of code at the same time. Each 
section of code can include normal constructs such as loops, pipelined loops and external 
macros [10]. Utilizing parallel regions, it is now possible to optimize code by executing 
two independent sections of code at the same time. Efficiency is maximized, if the 
section of code happens to take the same amount of time. For example, two independent 
for loops of the same iteration size, accessing independent OBMs would benefit from 
parallel sections. It is permissible to reuse loop names within the same parallel region. 


Table 9 shows the construct for parallel regions. 


Table 9. _—_- Parallel Regions 





#pragma src parallel sections{ 
#pragma src section{ 
int i; 

for (i=0;i<100, i++) 
afi]l=b[i]t+c[il; 

} 

#pragma src section{ 
int i; 

for (i=0;i<100, i++) 
e[i]l=d[i]+f[il; 

} 

} 











E. USING TWO LOGIC CHIPS 


This thesis is based upon the concept of expanding the previous code found in [2] 
onto two logic chips to reduce overall latency. The utilization of two logic chips adds 
new restrictions to what can be done but has the overarching positive effects of doubling 


the amount of logic and BRAM available to the programmer. 


The logic must be broken into two separate routines. The decision of where to 
separate the logic is a difficult one and must be uniquely decided for each situation. 
Designing the code for this thesis led to a decision to break the code in a place where the 
load was approximately half, and the mathematical computations were easily separated 


because the current computation was complete. 


aD 


One of the two subroutines must be designated for the “primary” chip and the 
other will be downloaded to the secondary chip. Certain restrictions apply to using two 
logic chips. A summary of the restrictions is found in Table 10. It is easiest to 


remember that the primary chip has control of all memory functions. 


The scheme developed for this thesis that saves an enormous amount of time in 
DMA transfers is to utilize the OBMs for information that is needed by the other chip, 
and use BRAM for intermediate calculations. Using BRAM in-between also helps keep 
the latency low, as loop dependencies are avoided. The call to pass permissions takes 
only a few clock cycles compared to (number of bytes*data type) number of transfers. 
Using the permission passing technique also allows for maximum overall pipelining 
because the pipeline is not stopped for a data transfer but only for a few clock cycles to 


pass memory permissions. 


Table 10. Dual Chip Restrictions (From: [10]) 

















Primary Routine Secondary Routine 
e Issues all DMAs e Has no access to MAP calling 
e Initially controls all accesses to parameters 
OBM e Cannot issue DMAs 
e Issues ONLY send _ perms e Issues only recv_perms 











To allow for synchronization between the two FPGAs during runtime, there are 
three 64b ports (named A, B, and C) that allow for the passing of one word on every 
clock cycle. These ports are best utilized to synchronize the two chips and to pass loop 
increment variables or for updating constants. Synchronization can be initialized from 
either routine and is implemented by the use of the flowing macro calls: 

send_to_bridge<x> (<send args>); //on primary 

recv_from_bridge<x> (<receive args>); //on secondary 

<x> is either A,B,or C 


<send args> must be a 64b scalar value 
<receive args> must be a pointer to a 64b scalar value 
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Memory permissions have to be managed. When the MAP starts up, all memory 
permissions reside on the primary chip and must be passed to the secondary chip prior to 
the use of the OBM bank on the secondary chip. During this time, the primary chip is 
unable to access the OBM banks and it does not retain permission to use. When all 
routines on the secondary chip are complete, memory permissions should be returned to 
the primary chip. The primary chip is the only chip that has the ability to DMA to and 
from the microprocessor. A simple technique for sending the permissions has been 
developed in [10]. Create a mask of which permissions should be sent by ORing the 
names of the banks to sent. Pass the mask name as an argument in the call to 
send_perms. To remove permissions to the secondary chip, set the mask equal to zero. 


The primary chip controls memory permissions. For an example, see Table 11. 


Table 11. _ OBM Permission Example 











Send permissions Remove permissions 
mask= OBM A || OBM_D; mask=0; 
send_perms (mask); send_perms (mask) ; 











Load balancing of the two subroutines is important because when the MAP is 
turned on, both chips will begin to execute code at the top of each subroutine. Not 
critical, but taking advantage of dead time by doing any precalculations, array 
initialization, or memory management is best done during the “idle” time. One example 
is to precalculate all sine and cosine values needed for an FFT and store them into one 
two-dimensional array. Sine and cosine are computationally expensive and using a FOR 
loop to calculate all needed values prevents multiple implementations of the sine/cosine 


hardware. 


F. SPACE SAVING TIP 


One way to save space on the FPGA is to become aware of functions that take up 

a lot of space. One such item is the Sine and Cosine module. To prevent multiple 

implementations, create an array of sine and cosine values that are needed. Then index 

into the array and read the value needed. Another learning point is that for the cost of 
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calculating the sine, the cosine is free. Make a two-dimensional array with these values. 
Another way to save even more space is to create the sine/cosine array on the uP and 
download by DMA to the MAP. Remember to pass memory permissions to the second 
FPGA when the values are needed on the second chip. The example below shows how to 
generate the sine/cosine two-dimensional array. A variable named Pi2 was created as 
another space saving variable that prevents from multiple calculations of 7/2. 

//bouild sin_cos array - used to save sine and cosine resources 

for (index =1; index<=4; indextt) { 


SIN_Array [index]=sinf (pi2/ (1<<index) ) 
COS_Array [index]=cosf (pi2/ (1<<index) ) 


} 


1’ 
la 


Another way to save space is to pay special attention to format conversion. 
Format conversions come from implicit casts from one data type to another. One type 
cast that is particularly expensive is the cast from float to double or double to float. This 


can be prevented by ensuring that both variables are of the same type. 


G. SUMMARY 


To take advantage of the SRC-6 reconfigurable computer architecture the 
programmer should take advantage of the techniques laid out within this chapter. Careful 
utilization of memory, avoiding loop dependencies, using parallel regions, and utilizing 
both user programmer logic chips are all ways to optimize your code when using the 
SRC-6. These techniques will prove to save time and space when implementing a dual 


chip design. 
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V. DUAL CHIP DATA STREAMING DESIGN 


A. CONCEPT OF OPERATIONS 


The data path for the dual chip data streaming design is shown in Figure 38 and 
starts in the top left of the diagram on the primary chip. As data is processed along the 
path through Channelize and Hamming steps, the secondary chip sits idle and progresses 
up to the first synchronization point. Synchronization is done at several steps to make 
sure the flow path is maintained. Memory permissions are passed from the primary to the 


secondary chip. 


Download 
Data 


Channelize 


Hamming 
Window 


FFT-1 
Pass Memory Permissions 


Downconvert 





OBM- 24 MB 


(6 Banks-4MB) 


Return Memory Permissions Secondary Chip 








| Return Data 





Primary Chip 


Figure 38. Dual FPGA Data Streaming Design 


Once FFT2 is complete, memory permissions are passed back to the primary chip. 
The data can only be returned to the microprocessor from the primary chip, as this 
function is only available on the primary chip. Time is saved in this design by avoiding 
passing data back and forth between the MAPs and the microcontroller. Instead, only 
memory permissions are passed. The length of time to pass memory permissions stays 
constant no matter what the sizes of the data packets are. As this project grew, the 
amount of data being used also grew. Almost .34 seconds are expended in the original 
design by transferring data back and forth in-between stages. This is avoided by passing 


the OBM permissions and costs only a few clock cycles in comparison. 


B. MEMORY USAGE 


A memory management strategy was developed for this project. Memory 
managed in banks limits the maximum size of a memory structure to 523776 64b words 


in each OBM. 


1. Onboard Memory (OBM) Management 


OBM usage required planning to prevent running out of memory or overwriting 
the same location with new data before the old data is utilized. To manage the data, a 
memory usage plan was developed. Considerations for OBM usage was for any array that 
needed to be passed to the other chip. Incoming and outgoing data, with respect to the 
uProcessor, was stored in OBM. Table 12. represents the OBM usage plan for this 
project. The incoming data (array al in bank E) and the outgoing data (finall and finalQ 
in banks A and B) are in the primary chip because only the primary can DMA with the 
uProcessor. The decision was made to logically split the data path in Figure 4 after the 
first FFT, and before the second FFT, where the multiplication is accomplished. After 
completing the multiplication, the size of the working arrays changes. This fact alone is 
the reason why the code was split after the first FFT and prior to the multiplication for the 


Computation of product sequences. This also creates a balanced load for each chip. 


To allow the data to be utilized by both chips, the same array has to be declared in 


both macros, as is done in Table 12 in banks C and D. 
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Table 12. OBM Usage Plan 


















































Primary Chip 
OBM_BANK_A_ 2D (finallI, double, P, Np*Np) 
OBM_BANK_B_ 2D (finalQ, double, P, Np*Np) 
OBM_BANK_C_2D (I_postFFT1, double, Np, P) 
OBM_BANK_D_2D (Q_postFFT1, double, Np, P) 
OBM_BANK_E al, double, NN ) //input data 

Secondary Chip 
OBM_BANK_A 2D (finalI, double, P, Np*Np) 
OBM_BANK_B 2D (finalQ, double, P, Np*Np) 
OBM_BANK_C_2D (I_postFFT1, double, Np, P) 
OBM_BANK_D_2D (Q_postFFT1, double, Np, P) 





























Note there are restrictions for the data types that can be stored in OBM. Recall 
from chapter IV that OBMs can only transfer 64b data types. To store 32-bit floats 
requires a data packing operation to be done. This concept will help conserve memory 
because one can store two floats in one memory location. Remember that if you are 
combining data to an outbound array, the proper data type pointers need to be declared in 


main.c. 


2. Block RAM (BRAM) 


BRAM was used for the remaining variables and arrays. BRAMs are the memory 
of choice when creating pipelined structures because perfect pipelines can be created 
without the possibility of dirty data. Dirty data is data that is written back to the same 
array or variable it was read from creating a possibility of cross-contaminating memory. 
Keeping the data flowing through the pipeline also maintains high throughput. One of 
the goals listed in this thesis was to increase throughput at the expense of hardware. That 
is why extra care is taken to move along data and prevent loop slowdowns and memory 
dependencies by utilizing extra BRAMs. As resolution is decreased, the need for more 
memory is increased and compromises will be made with respect to the BRAM and OBM 


that may slow down the algorithm because of the need to reuse memory. 
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C. FILE TYPES 


Previous chapters have mentioned that several types of files are required for the 
compilation process to generate the application executable file. The software described 
by this thesis uses three file types that the programmer is responsible for maintaining. 
The Makefile, .c, and.mc file types are needed for the described work. The Makefile is a 
template file found in [10]. SRC recommends starting with their template and 
customizing it. The Makefile for this work started from the same template. The Makefile 
is the instruction list for the compiler to know which files are to be compiled. There is a 
section for compiler options and flags that can be set by the user. The SRC 


documentation overlooks discussion of compiler options. 


Place and route, Make, and bitgen options have an enourmous impact on the 
compilability of more complicated programs. SRC documentation does not list these 
options in the CARTE documentation [10] but they can be found in Xilinx 
documentation [15], see Table 13. The SRC compiler for the Virtex FPGAs uses Xilinx 


software for parts of the compilation process. 


Table 13. Effort Level Options (From: [15]) 





























Option Function Range Default 

—ol ; std, med, high std (Overall effort level 
eraibeporedeval Placement and routing effort level std) 

—pl Placement effort level (overrides Sa aed neh Determined by the —-ol 
placer_effort_level -ol value for the placer) : ans setting 

—rl Routing effort level (overrides—ol aed tek Determined by the —-ol 
router_effort_level value for the router) iar ae setting 
Sion sort el Set extra effort level normal, continue No extra effort 








The .c and .mc file types are similar in appearance because they are both written 
in ANSI C. The only difference is that the .c will be specified for the uProcessor and the 
.mc will be compiled for the MAP. This is accomplished in the beginning of the 
Makefile. For this thesis, three additional files in addition to the Makefile were written. 
Demain.c, dcp.mc, and des.mc. The following code segment is from the Makefile 


showing how the files are targeted for the appropriate hardware. The Makefile also 
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allows for custom designation of the name of the executable. This is set by the BIN 


variable. The variable dc is used for this thesis. 


#Files targeted to pProcessor 

FILES = dcmain.c # dual chip main file 

#Files targeted to MAP 

MAP_E_ FILES = dcp.mc \ # dual chip primary chip file 
des.mc \ # dual chip secondary chip file 
































BIN = dc #name of executable 
# 

# Multi chip info provided here 

# Designate files for specific FPGAs 

# 

PRIMARY =dcp.mc 

SECONDARY =dcs.mc 

CHIP2 =dcs.mc 





This thesis uses multiple FPGAs on one MAP so a file for each FPGA is needed. 
Dcp.mce is the Primary file, dcs.mc is the secondary file. The flag CHIP2 must be set to 
the name of the file that is targeted for the second chip. The complete code listing is in: 


APPENDIX B: makefile. 
APPENDIX C: dcmain.c 
APPENDIX D: dcep.m 
APPENDIX E: des.m 
APPENDIX F: misc files. 


D. FAST FOURIER TRANSFORM 


There are two FFTs in the Cyclostationary FAM algorithm. Once the data 
management plan was implemented, the majority of time is spent on the FFTs. (Chapter 
VI covers timing analysis in detail.) The second FFT is smaller in size (8 point) but 
performs more calculations. The second FFT is performed 4096 iterations. The FFT 
algorithm, as designed in [2], which came originally from [16], was very efficient when 
implemented as a standard C program. However, when ported to SRC-6 code, several 
memory dependencies and loop slowdowns were introduced. These slowdowns (similar 
to the topics covered in Chapter IV) resulted in a clock per iteration of 47 and pipeline 
depth of 52. When this research was started, this was identified as one place that could 
use the techniques from Chapter IV. The improvements were incremental, as shown in 


Table 13. To achieve the results of Revision 2, the FFT algorithm in Figure 39 was 


63 


utilized as a starting point. To achieve pure pipelining, the outer loop was removed and in 
its place n (three for an eight point FFT) copies of the inner loop are used. This changes 
the further usefulness of the FFT algorithm as it is no longer scalable. This was 
acceptable because of memory limitations within the system. Having a fixed size FFT 
ensures that everything will fit within two FPGAs. Reconfiguration times are 
unacceptable for a real-time system. Recompiling all of the files required usually takes 
approximately 4 hours from start to finish. The calculations for latency found in Table 14 
are representative of one complete computation of a single FFT on n points. The formula 
for Latency is shown in equation 1.8 and is simply the clocks per iteration times the 
pipeline depth. For revision 1 and 2, this number is multiplied by three because there are 
3 loops of each pipeline. Revision 2 is made up of two loops to achieve | clock per 
iteration. An additional pipelined loop of 5 is needed to reformat the memory structure 
utilized into the one that is expected by the main program. 


Latency = Clocks per Iteration * Pipeline Depth (1.9) 


Table 14. FFT timing improvements 














Original Revision | (3x) Revision 2 (3x) 
Clocks per Iteration 47 2 1 
Pipeline Depth 52 34 (33+5) 
Latency 2444 204 114 
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[TERATIVE-FFT(a) 
1 BitT-REVERSE-Copy(a, A) 


2 n+ length{a] > n 1s a power of 2. 
3 fors —1tolgn 

4 do m< 25 

5 Om e e2ni/m 

6 w<- 1 

i for 7 — 0 to m/2-1 

8 do fork —jton—lbym 

9 do t+ wAl[k+m/2] 
10 u+— A[k] 

11 A[k]-—u+t 

12 A[k+m/2]—u-t 
13 WO —-WOOWm 

14 return 4 


Figure 39. FFT pseudo code (From :[16]) 


ip In-place Butterfly Design 


The design of this FFT, as realized by the code in Appendix B, is based off 
implementing several parallel copies of a traditional butterfly design. The image in 
Figure 40 shows what the three-stage pipeline architecture of FFT2 looks like, as 


implemented by this thesis. 
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Figure 40. Three stage pipeline of FFT2. 


Each stage in Figure 40 is separated by a BRAM of data type float. When stage 1 
has finished with the computation the results are written to BRAM. Stage 2 then reads 
from memory, performs its calculation, and passes the results along in the same manner. 
This pipeline eliminates most of the data dependencies of the original version. Another 
point to note is all butterflies in each stage are computed in parallel. This also reduces 
latency over other designs. Because each butterfly is implemented in hardware, this 


design is fast and utilizes the most hardware. 


A second technique was applied by creating a copy of the input data to eliminate 
multiple accesses to the same memory bank. This requires a small additional step of 
making a copy of the data. This can either be done when the data is created or as a small 
additional loop. To keep the memory dependencies down throughout the FFT, copies of 
each stage are also created for the next stage. The code below shows how necessary code 
was altered to create a second copy of the data with minimal increase in delay. This 
section of code performs two functions. To arrange the data for the first stage, the data is 
rearranged to the sequence shown on the left side of Figure 40. At the same time, the 
data is split into the real and complex parts to do the multiplications separately. The SRC- 


6 does not have complex multipliers. This code segment originated in [2] and was altered 
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to create the al_copy and a2_copy arrays to eliminate the data dependencies. The code 


segment below shows how additional copies of data are created for use in this thesis. 


/* reorder input and split input into real and complex parts */ 
for (i=0; i<n; itt) 
{ 
/* reverse bits 0 thru k-1 in the integer "a" */ 
for (ii=0=0, p = 1, q = 1<<(log2n-1); 
1li<log2n; 
lit+, p <<= 1, gq >>= 1) if (1 & q) O=o | p; 








2. Growth of the FFT 


The FFT implemented for FFT2 is fast and minimizes memory dependencies and 
loop dependencies. This FFT is still flawed in that it is not scalable on the fly. 
Changing the size of the FFT requires several hours of programming and testing prior to 


recompilation. 


Growing this FFT requires a basic understanding of the data path shown in Figure 


41. To make a larger FFT, calculate the number of required stages (S) utilizing: 


S=1 
08, (n) (1.10) 
where n = number of points of FFT 


The number of stages is the number of loops needed for the FFT. Additional 
variables are also needed to go between stages. The output from one stage becomes the 
input to the next stage. The output of the last stage must be the input to the data flip 
portion to ensure that all stages are implemented in the pipeline. The variable n at the top 
of the program must be adjusted to the new n so the bit reversal section works at 


designed. 
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Bit Reversal 


Stage 2 











To expand for larger FFT 
Replicate intermediate stages >log2(n) 
Adjust n- to increase bit reversal and 
data flip. 


Data Flip 


Data 
Memory 
Access 


Figure 41. Growth of FFT2 


E. TIMING ANALYSIS 


The goal of this thesis was to reduce the overall latency compared to the previous 
thesis [2], utilizing the same algorithm and modifying it further to take advantage of the 
architecture of the SRC-6. When this research was started, there was no specific number 
in mind, other than the thought that the extra expense in hardware should justify a 
significant reduction in latency between data in and data out. This goal was accomplished 


successfully, as shown in Table 15. 


The overall run time is the most important because it is the measure of when the 
SRC-6 can provide an output that is usable to the consumer of the data. When missiles or 
airplanes are flying in toward a ship, seconds count. Having a hardware solution that can 
give the user a meaningful output in less than a second is critical. Comparing 2.170836 
seconds to .38093 seconds, one can conclude that this is a significant improvement in the 
right direction. A second comparison is to observe the amount of time that both 
algorithms spent on the FPGAs is practically identical. Careful understanding of what is 
happening in the dual chip design would highlight that the .25 seconds spent on the 
FPGAs is for both FPGAs working in parallel and is another example of how parallel 


computing can lead to efficient computing. 
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Another major goal of this thesis was to reduce the amount of time performing 
DMAs. A major pay off was accomplished by eliminating all but the initial download 
and return of data to the uP. A reduction from .3898 seconds to .000698 seconds helps 


account for the significant reduction in run time for the processed data. 


Table 15. Final timing comparison. 








Upperman’s Thesis This Thesis 
Execution times: Execution times: 
seconds total seconds total 
Of the total time: seconds on FPAG 

| seconds were spent on the CALLS to FFTs 

0. seconds were spent on the MAP for FFTs Of the total time: 

Of the time spent on the MAP for FFTs: 0.038027 seconds were spent performing FFTs 
0.382998 seconds were spent on DMAs Of the time spent on the MAP for FFTs: 
0.075691 seconds were spent in the FFT loop 0.000958 seconds were spent on FFT 1 

0.116433 seconds were spent on the CALLS to 0.037069 seconds were spent on FFT2 

Channelize 0.000149 seconds were spent on Channelize 

0.000157 seconds were spent on the MAP for 0.004611 seconds were spent on Downconvert 

Channelize 0.006939 seconds were spent postFPGA data 

Of the time spent on the MAP for Channelize: processing 

0.000024 seconds were spent on DMAs 0.000686 seconds were spent on DMAs 


0.000133 seconds were spent channelizing 
0.106429 seconds were spent on the CALLS to 
Downconvert 
0.003294 seconds were spent on the MAP for 
Downconvert 
Of the time spent on the MAP for Downconvert: 
0.000730 seconds were spent on DMAs 
0.002564 seconds were spent downconverting 
Execution time not including calls and data transfers: 
seconds 
Results from Cyclostationary FAM algorithm written to: 
FAM result.txt 











F. FUTURE WORK 


Greater resolution allows for data that is more useable by the ultimate customer. 
To achieve a resolution greater than or equal to the recommended N=1024, more work is 
required. To reach these combinations, different FFTs are needed, as mentioned 
previously. One idea not yet tested is to use Intellectual Property of previously designed 
FFTs found in the Xilinx Project Navigator to implement the different FFTs that are 
needed. The SRC-6 has an FFT macro that does an FFT for sizes greater than n=256. 
Upperman experimented with this and found it to waste a lot of time calculating the FFT 
for points that were zero. This was accomplished by stuffing zeros anywhere there was 


not a point. 
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The original algorithm implemented for the FFTs has memory dependencies and 
loop dependencies that were eliminated by utilizing the recommended techniques 
mentioned previously in this thesis. Utilizing the technique outlined by Figure 41 yielded 
a design that exceeded available slices on one FPGA. NPS is scheduled to receive a 
CARTE update that may help with this problem. The update to CARTE has new 


floating-point macros that should save approximately 10% in space. 


Recall that the benefit of the Cyclostationary algorithm is the extractability of the 
parameter for Code Rate. The minimum resolution determined by experimentation in 
this thesis is a result of N >= 1024. Not meeting the minimum resolution makes the 
Cyclostationary Block equal in performance to the other blocks of Figure 1. Other 
algorithms are more efficient to implement and would make better algorithms for 
obtaining center frequency and bandwidth. Once new FFTs are designed, it would be 


easy to update the code in Appendices B-F for a higher resolution. 


G. SUMMARY 


The Dual Chip Data Streaming Design developed for this thesis was able to 
successfully utilize both FPGAs on a single MAP by implementing techniques of the 
previous chapter. Overall runtime is improved to approach desired specifications that 
make sense implementation into a real-time system. A path of development is laid that 
can take this project to the next level by developing a SRC-6 specific FFT that has less 
than 256 points. The future of the Autonomous LPI System is bright as lessons learned 


in this thesis can be applied to the development of all modules in Figure 1. 


70 


VI. CONCLUSIONS 


The goals for this thesis have been achieved in both reducing overall latency and 
implementing the Cyclostationary FAM algorithm on one MAP utilizing two FPGAs. 
Achieving a run time for a sample of data that is less than a second is satisfactory proof 
of concept for utilizing MAPs for creating the Autonomous LPI system in Figure 1. A 
greater reduction in overall runtime would result from the implementation a more time 
and space efficient FFT. Another possibility is to use faster MAPs as they become 
available. The new H series map is clocked at 150 MHz vs. the 100 MHz of the E-series. 

A path for future development has been laid out and requires that N be greater 
than 1024. This will ensure that frequency and cycle-frequency resolutions are sufficient 
to obtain the most useable data from a single iteration of the program. Chapter V 
highlighted a few methods for moving forward to achieve these resolutions by designing 
a new FFT or utilizing IP from Xilinx. FFTs are a common macro needed for several of 
the blocks of Figure 1. Further study of an SRC-6 FFT would be beneficial to the overall 
project. 

The SRC-6 is more difficult to utilize than advertised but does provide amazing 
results once the programmer understands one basic concept. Take advantage of what the 
SRC-6 does well and avoid things (floating point, sine/cosine operations) that are costly 
in both timing and space. 

Another concept realized by this thesis is that at the cost of hardware, overall 
latency can be reduced. This important trade off is necessary when more work needs to 
be done in the same amount of time. Removing unnecessary data transfers by 
implementing a data management plan that passes memory permissions saves time over 
moving data that takes a long time. The time needed for data transfer is directly 


proportional to the amount of data being transferred. 


The path forward for this project is optimistic but will be difficult. Careful 
attention to resources is required, as memory will become more limited and number of 
slices available dwindles. Consideration of multi-core processors may lead to another 


solution that will demonstrate timing better than those presented in this thesis. 
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APPENDIX A. MAP SPECIFICATIONS 


sR MAP® Specifications 


Logic 

User Logic Chips 
Nominal Glock Rate (MHZ) EO 
[Total User Logic Reconfiguration ime ms} 60 | 100 | 100 | 100 | 100 | 100 ~~ 


[OBM BW io User Logie (Gbytes) | 64 __| 
[OBM BW to ControlLogic (Gbytes) | 48 _| 
2 
[18 | 


(reads or writes) 


Sustained MAP Input Payload BW from 14 07 7.2 

System (Gbytes/s) é : 

Sustained MAP Output Payload BW to 

Syn (Goes Bea ae ee ee 


Simultaneous Sustained MAP Payload I/O 
swoadronseun comes | 2° | 2@ | 2¢ | o7 | ue | 28 | 
General Purpose I/O (GPIO) 
Number of GPIO Ports per MAP 
GPIO Signal Level Standards 


ee et ee ee ee eee eee) 
LWTIL LVTTL/ LVTTU 2.5LVTTL 2.5VTTU 2.5VTTL 
LVDS LVDS LVDS LVDS LVDS 


# Signal Paths per GPIO Port 112/51 
Sustainable GPIO BW per Port (Gbytes/s) 2.4/3.2148 | 2.4/4.8 3.6/4.8 2.4/4.8 


Maximum Data Rate per Signal Path 200 200/800 | 200/800 | 300/800 200/800 
(Mbits/s) 
: Coax Coax Coax User User User 


Tiaximum Power Consumption wats) | | wo [| « | =o | «© | wo | 
fomrowr | em _fomown| oo | Sa | om | OIE 
[Cooing Metodoiogy | Ar | ASpray [Ar | Ar | Av | ArSpray_| 
aC 
Ee ee (Se ee (ee ee 
[avaiabiity | Now | Now | Now | now | cna006 | Now| 


SRC reserves the right to change these specs at any time. 
* RL = Random Logic 


Figure 42. MAP Processor Specifications (From : [17]) 
fe 
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APPENDIX B. MAKEFILE 







































































# 
# Origional Template from CARTE 2.2 
# Modified by: LT Wesley Simon 
# For : thesis 
# Date : Septemeber 2009 
# 
FILES = dcmain.c 
MAP_E_FILES = dcp.mc\ 
dcs.mc 
BIN = dc 
# 
# Multi chip info provided here 
# (Leave commented out if not used) 
# 
PRIMARY = dcp.mc 
SECONDARY = dcs.mc 
CHIP2 = dcs.mc 
MAP TARGET = map_e 
# 
# User supplied MCC and MFTIN flags 
# 
MCCFLAGS log -use_par -v -keep 
MF TNF LAGS = -log 
# 





# User supplied flags for C & Fortran compilers 


# 




















CC = icc TES for Intel cc for Gnu 
FC = ifort ifort for Intel £77 for Gnu 
LD = icc for C codes 

CFLAGS =-03 -tpp7 -xW 

MY_FFLAGS = 


PAR_OPTIONS 


-ol high -t 50 











# 
# No modifications are required below 
# 





MAKIN ?= $(MC_ROOT) /opt/srcci/comp/lib/AppRules.make 
include $ (MAKIN) 


mydebug: debug 

myhw: hw 

myclean: clobber 
rm ari se 


fe 
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APPENDIX C. DCMAIN.C 


// 
// FILE : dcmain.c 

// description: File will reside on microprocessor and is responsible 
for downloading 











if data to the FPGA and returning data. some post FPGA 
processing is done 
// along with writing to files. 





// Original author: Gary Upperman 
// Modified by: LT Wesley Simon 





// For : thesis 
// Date : Septemeber 2009 
// 


//Include files 

include <libmap.h> 

include <map.h> 

include <time.h> 

include <math.h> 

include "high_prec_time.c" 











define pi 3.141592653589793 

















FILE *I_ptr; // pointer to the I-channel input file name 

FILE *IFFT1_Out; // pointer to the I-channel output file name 
// for the first FFT results 

FILE *QFFT1_Out; // pointer to the Q-channel output file name 
// for the first FFT results 











FILE *IFFT2_Out; // pointer to the I-channel output file name 
// for the second FFT results 
FILE *QFFT2_Out; // pointer to the Q-channel output file name 
// for the second FFT results 

































































FIL 


A 


*Output; // pointer to the final data output file name 








void PrimaryChip (double *, double *, double (*)[], double 
(*) [],int64_t *,int64 t *,int64 t *,int64 t *,int64 t *, int); 


//dcmain function 
int main() 


{ 








/* DECLARE VARIABLES AND CONSTANTS */ 




















/* declare file names and path */ 
char I_file[] = "I_channel.txt"; 
char IFFT1_Out_file[] = "IFFTl_out.txt"; 
char QFFT1_Out_file[] = "QFFT1l_out.txt"; 
char IFFT2_Out_file[] = "IFFT2_out.txt"; 
char QOFFT2_Out_file[] = "OQFFT2_out.txt"; 
char Output_file[] = "FAM_result.txt"; 








/* Declare Input Variables */ 
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int fs 7000; // sample frequency 
int df = 128; // frequency resolution 
int M = 2; // M = df/alpha 


/* Declare all timing variables and get first time hac; 
start timing */ 
struct timeval startl, start2, start3, temp_stop, timel; 
struct timeval subr_go, subr_return, endofprogram, 
startprogram, post_dataread; 








Float overall_time = 0.0; 
loat timeonFPGA = 0.0; 
loat DMAtime = 0.0; 

loat channelize =0.0 





£ 
loat downconvert= 0.0; 
loat FFT1 =0.0,FFT2 = 
loat datainput= 0.0; 

loat postFPGA= 0.0; 

int64_t tchannel, tfftl, tfft2, tdownconv, tDMA; 
int timei; 


0.0,FFT_time= 0.0; 








Fh FH FH FH Fh Fh FH tt 





//INITIAL TIMING MARKER 
gettimeofday (&startprogram, NULL); 





//*****declare additional variables******* 
/* calculate dalpha */ 
double dalpha = df/M; 
/* determine number of input channels: fs/df */ 
double Np = pow(2.0, ceil(logl10(fs/df)/log10(2)) ); 
/* overlap factor in order to reduce the number of short time 
fft's. 

















L is the offset between points in the same column at 
consecutive 


rows. L shoud be less than or equal to Np/4 
(Prof. Loomis paper) */ 
double L = Np/4; 





/* determine number of columns formed in the 
channelization matrix (x) */ 
double P = pow(2.0, ceil(logl0(fs/dalpha/L)/logl0(2)) ); 
/* determine total number of points in the input data to 
be processed */ 
double N = P*L; 
/* declare other variables and arrays to be used 
Note: I tried to declare them in the order in which they 
are needed. Some were consolidated. */ 
/* Loop Indexes */ 
int i= 0, j = 0, k=0, index = 0; 
/* Array to contain values from input file */ 
float *I_Values; 


























I_Values = (float*)malloc(N * sizeof(float)); 
/* Initial Array and Matrix */ 

double NN = (P-1)*L+Np; // resizes x array 

double *x; 

x = (double*)malloc(NN * sizeof (double) ); 

double *HAM; 

HAM = (double*)malloc(Np * sizeof (double)); 
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em 


/* 


/* 


int xInitialMax; 


/* Declare Variabl 
int nmap = 1, mapnum = 
double (*Ifinal) [ (int) 
double (*Qfinal) 


fin 

fin 
ent: 

/* 





int64_t joverNp, 





( 
al = Cac 





he_Al 








al Cac 


a; 


le 
Pr 
le Iscr, 




















alpha, 





he_Al 


loat IXF2[(int)P][(int) (Np*Np)], 


I 

Q 

i 

Declare Output Variables */ 
a2 

d (*MM) [ (int) (Np*Np) 1], 
£ 


f, 
QOscr; 
rem; 









































ligned_Al 
ligned_Al 


// vook-keeping on x array 
les used for MAP allocation */ 


0; 
(Np*Np) ]; 


(int) (Np*Np) ]; 
locate(P * Np * Np *sizeof (double) ); 
locate(P * Np * Np *sizeof(double)); 














OXF2[ (int) P] 
(*Sx) [2* (int) N+1]; 


kk, 11, Sx_max; 




















[ (int) (Np*Np) ]; 


Sx = Cache_Aligned_Allocate((Np+1) * ((2 * N)+1) 
*sizeof (double) ); 
MM = Cache_Aligned_Allocate( ( (int) ((3*P/4)-(P/4))+1) * Np * 
Np * sizeof (double)); 
GET SECOND TIME HAC; STOP TIMING TO BRING DATA IN */ 
gettimeofday (&temp_stop, NULL); 
printf ("\n"); 
OPEN THE INPUT FILES */ 
I_ptr = fopen(I_file, "r"); 
if (I_ptr==NULL) 
{ 
printf("Error opening I-channel input file.\n"); 
return(1); 
} 
READ IN THE I-CHANNEL FILE */ 
while ( (fscanf(I_ptr, "Sf", &I_Values[i]) != EOF) && (i<N) ) 
{ 
x[{i] = I_Values[i]; 
itt; 


/* 


/* 


This Loop fills the 








rows of data in the input file */ 
if (i < N) 
while (i < N) 
{ 
x[i] = 0; 
i+t; 
} 
xInitialMax = i; 
fclose(I_ptr); 
GET THIRD TIME HAC; RESTART TIMING */ 





gettimeofday (&post_da 


timei 








timeval_subtract 








datai 


nput 


taread, NULL); 


(&timel, &post_dataread, 


timel.tv_sec + timel.tv_usec*l1.0e-6; 
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xX array with zeros if there wasn't N 


&temp_stop); 


Y2EAVLGAAAAAALAAAAAAAAAAAAAAAAAAAX9 QD 





INPUT CHANNELIZATION - this part limits the total number of points 
to be 
analyzed. It also generates a Np-by-P matrix, X, with shifted 


versions of the input vector in each column. 
2%209099,9.999900999.999990099.9.999990099.9.999990099.9.999990999.999900999999000 
0000000000000 0000000000000000000000000000000000000000000000000000000 0070 
SSS */ 
/* Zero fill x if we don't have NN samples. The loop does the 





xx (NN) = 0 loop in the Matlab code. */ 
for (i=xInitialMax; i<NN;i++) x[i] = 0; 








/* Reserve MAP */ 
if (map_allocate (nmap) ) 


fprintf(stdout, "Map allocation failed for 
channelization.\n"); 
exit(1); 





} 
/* SET up Hamming window for MAP */ 
for (i=0; i<Np; i++) 











HAM[i] = 0.54 - 0.46*cosf(2*pi*i/ (Np-1)); 
printf("\t %1.5f£",HAM[i]); 
} 
/* Take time hac */ 
gettimeofday (&subr_go, NULL); 


[RERRKEKK Call subroutine and Restart Timing 2*** XA xe RAR AA AERAE RAE RE / 


PrimaryChip(x,HAM,Ifinal,Qfinal,é&tchannel, &tfftl, é&tfft2, 
&tdownconv, &tDMA, mapnum) ; 


[RRR KKK KK KK KK KK KKK KK Return from FPGAS KOK AK RK KK OK OK OK Ke / 


gettimeofday (&subr_return, NULL); 

channelize= tchannel*le-8; 

downconvert=tdownconv*le-8; 

FFT1 = tfft1l*le-8; 

FFT2 = tfft2*le-8; 

FFT_timet+=FFT1+FFT2; 

DMAt ime=tDMA*1le-8; 

timei = timeval_subtract (&étimel, &subr_return, &subr_go); 

timeonFPGA+= timel.tv_sec + timel.tv_usec*1l.0e-6; // time spent 
on FPGAS 














//partial printout for verification 
printf ("\n********* ON CPU Readout of FFT 2 *****\n"); 
for (a=0; a<5; att) 
{ 
printf ("\t%2.8f+i*%2.8f\n", Ifinal[a][0], Qfinal[a][0]); 
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/* Free map */ 
if (map_free (nmap) ) 


{ 





printf ("Map deallocation failed for downconversion.\n"); 
exit(1); 
} 
//Post FPGA Data Maniuation 


ooo 





an 
the matlab code in one single loop 
SSSSSCEEEEEEEEEEEEEEEEEEEEEEEEES ESE SSS SS SSS SSS SEEEEEEEEEEEEEEEEEESESESS 
SS */ 
/* Swap bottom and top halves: */ 








for(i=0; i<=(P/2-1); itt) 
{ 








for(j=0; 3<Np*Np; j++) 

{ 

/* Bottom real half becomes top real half */ 

IXF2[i][j] = Ifinal[it+(int) (P/2)] [3]; 

/* Top real half becomes bottom real half */ 
IXF2[i + (int) (P/2)][3] = Ifinal[i] [34]; 














/* Bottom imaginary half becomes top imaginary half */ 
QOXF2[i] [4] = Qfinal[it+(int) (P/2)][j]; 
/* Top imaginary half becomes bottom imaginary half */ 
OXF2[i + (int) (P/2)][j] = Ofinal[i] [jl]; 
} // for j 
} If for tb 




















/* Obtain the magnitude of the complex values */ 
for (i=(P/4)-1; i<(3*P/4); itt) 

{ 

for(j=0; J<Np*Np; jtt) 

{ 

Iscr = IXF2[i]l[j]; 

Qscr = QOXF2[il][j]; 

MM[i-(int) (P/4)+1] [4] = sqrtf( (Iscr*Iscr) + (Qscr*Qscr) ); 





ke f7 ton 4 
} // for i 
/* 
[oa oan © as oan © as © Sam © al © al © Sas © as © Sal © Sas © al © a © al © al © am © as © Sl © Sal © as © al © Dal © al © a © Sal © Sal © Sal © al © al © Dal © al © al © Dal © al © Sal © al © al © Sa © al © al © © al © al © Dal © al © Sk © al © al © Sa © al © al © © al © al oS © sal © al © Sa oul oa oa ol oom oa ol oom ooo eo ees 
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEETLEEELEEEEEEETEEEETEEEETEEEEEEEEEEEEEEEEEEEES 


DATA DISPLAY - display only the data inside the range of interest - 
centralizes the bi-frequency plane according to alphaO and f0 
vectors. Note: the alphaO and f0 vectors are defined as 





follows 
(in matlab terms): 
alpha0O = -fs :fs/N :fs; 
£0 = -fs/2:fs/Np:fs/2; 


but are not declared in this program since they are only used 
for plotting the results. 


81 


[oa oan © Sas oan © as © Sam © ar © al © Sas © al © al © Sas © al © Sa © al © al © a © as © al © Sal © al © al © Sal © al © Ua © Sal © al © Sa © al © al © Dal © al © al © Sal © al © a © al © al © Sa © al © al © Sal © al © al © Sal © al © © Sa © al © al © al © al © © al © al © © sal © al © a © al oS © al ol oo oa oOo Oem o Oem OD 
COS SSE SCC CSS ESC CSCS EES CCC SEE SC CCE SEES CCC EEE CC CCE SECC CCE SECCCCCSEECECCE SEES 
oo x 
SS */ 

Sx_max = 0; 


/* Clear Sx matrix since not every location is necessarily written 
to. 
Seems like this loop is unnecessary, but I had instances where 





old 
data in the memory was being used. */ 
for (i = 0; i<Npt+l; i++) 
{ 





for (j=0; 4<2*N+1; 3++) 
{ 
Sx[i]l[j] = 0; 
be Pye For’ gy 
} // for i 


/* Determine Final Output */ 
for (i=0; i<=.5*P; i++) 
{ 
for(j=0; J<Np*Np; jtt) 
{ 
joverNp = (int) ((j+1)/Np); 
rem = (j+1) - Np*joverNp; 


if (rem == 0) 
{ 
Crs. .5*ND Sl} 
} 
else 
{ 
eG = rem -—'-,5*Np: > 1; 
} 
k = joverNp — .5*Np; 
p=i- .25*P; 


alpha = ((k-c)/Np) + ((p-1)/N); 
fF = .5* (kt+c) /Np; 


if (((alpha > -1) & (alpha < 1)) | ((f£ >-.5) & (f£ < .5)) 


kk = 1+Np*(f + .5); 
if ( (kk-(int)kk) < .5) kk 


ll 
eo 
2] 
5 
a 
0) 


lse kk 


ll 
“> ] 
5 
= 
m 
Y 





11 = 1+N* (alpha + 1); 
if ( (11-(int)11) < .5) 11 = (int)11; else 11 = (int)11 + 1; 




















Sx[(int)kk-1] [(int)11-1] = MM[i] [3]; 


/* find max value of Sx so it can be normalized later */ 
1f(MM[i] [3] > Sx_max) {Sx_max = MM[i][j];} 
} // end if 
} // for j 
} // for i 
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/* Normalize Sx - ORIGINAL */ 


/* 


for(i=0; i<Nptl 
{ 
for(j=0; 43<2 


get fourth time 


gettimeofday (&e 





; i++) 
*N+1; j++) 


= Sx[i] [j]/Sx_max; 


hac (stop timing) and display: */ 


ndofprogram, NULL); 
_subtract (&timel, é&endofprogram, &subr_return) ; 


postFPGA += timel.tv_sec + timel.tv_usec*1.0e-6; 

















_subtract (&timel, &endofprogram, &startprogram) ; 
timel.tv_sec + timel.tv_usec*1.0e-6; 


g to terminal 
on times:\n"); 








£ seconds total\n", overall_time); 
f seconds on FPAG\n", timeonFPGA) ; 
e total time:\n"); 
f seconds were spent performing FFTs\n", 


he time spent on the MAP for FFTs:\n"); 
3.6f seconds were spent on FFT1\n", FFT1); 
3.6f seconds were spent on FFT2\n", FFT2); 








f seconds were spent on Channelize\n", 





f seconds were spent on Downconvert\n", 
3.6f seconds were spent postFPGA data 
3.6f seconds were spent on DMAs\n", DMAtime) ; 


tion time not including data input and DMA time: 
rall_time - datainput—DMAtime) ; 





timei = timeval 
timei = timeval 
overall_time = 
//display timin 
printf ("Executi 
printf (" $3.6 
printf (" $3.6 
printf("\nof th 
printf (" $3.6 
FFT_time); 
printf (" OEE 
printf (" % 
printf (" % 
prince ( $3.6 
channelize); 
printf (" $3.6 
downconvert) ; 
printf (" % 
processing\n", postFPGA) ; 
printf (" % 
printf ("\nExecu 
%3.6£ seconds\n", ove 
/* PRINT OUTPUT FILE */ 








Output=fopen (Ou 


if (Output == NUL 


{ 





puts ("Error 
return(1); 


for(i=0; i<Np + 
{ 
for (j=0; 4<2 
{ 
fprintf (oO 
} 
fprintf (Outp 
} 
fclose (Output) ; 





tput_file, "w"); 
1) 











creating output file."); 


1; i++) // Np + 1 for Sx, 5 for MM 
*N + 1; j+t) // 2*N + 1 for Sx, Np*Np for MM 
utput, "%6.32f\t", Sx[i][j]); 


ut, "\n"); 
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printf("Results from Cyclostationary FAM algorithm written to: 
s\n", Output_file); 








printf ("\nEnd of FAM Program Execution\n") ; 
return 0; 
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APPENDIX D. DCP.MC 





// FILE : dcp.c 
// description: File resides on the MAP's primary Chip 








// Modified by: LT Wesley Simon 


// For : thesis 
// Date : Septemeber 2009 
// 





#include <libmap.h> 
#include "FAM _const.c" 
#define n 64 





void PrimaryChip (double xx[],double ham[], double Ifinal[], double 
Ofinal[],int64_t *t_channelize, int64 t *t_fftl, int64_t *t_fft2, 
int64_t *t_downconvert,int64_t *t_DMA, int mapno) { 


/* Memory Management Plan */ 


OBM_BANK_C_2D 
OBM_BANK_D_2D 
OBM_BANK_A_2D 
OBM_BANK_B_2D 
OBM_BANK_E 
OBM_BANK_F 


IXE, double, P, Np) 

QXE, double, P, Np) 
finallI, double, P, Np*Np) 
fi 

al 











inalQ, double, P, Np*Np) 
, double, NN ) //input data 
HAM, double,Np) //hamming window data 








( 
( 
( 
( 
( 
( 








/* Declare Other Variables */ 

int64 t i, j, k,r,a, nbytes, index, k2, s, log2n; 
float hamming; 

int L = Np/4; 


/* Timing Variables*/ 

int64_t Tstart, TafterC,Tfftlstart,Tfftldone, fft2,dc; 
int64_t Tdma,dmastart,dmastop,dmain, dmaout; 
/*Declare channelize variables */ 

loat X[Np] [P];//input data stored here 

loat XW[Np] [P];//post hamming window 














Fh Fh 


//FEFT1 Variables 
Float I[n],Q[n]; 
float al_copy[(int)Np], a2_copy[ (int) Np]; 
Float al[(int)Np], aQ[(int)Np]; 
float temp_I_out[(int)Np],temp_Q_out[(int)Np]; 
float FFTI_out[(int)Np] [(int)P],FFTQ_out[ (int)Np] [(int)P]; 
int m,0O,Z; 
unsigned int ii, p, q; 


loat wml, wm2, wl, w2, t0, tl, t2, ul, u2,ula,u2a; 


//d 


wnconvert variables 
loat Ii, Qi, Ij, Qj; //temporary I and Q scratch variables 








rh O 


85 


float IXF1[(int)Np][(int)P], QOXF1[(int)Np] [(int)P]; 
float twdlI[(int)Np][(int)P], twdlQ[(int)Np][(int)P]; 





/* Start MAP timing */ 
start_timer(); 
read_timer(&Tstart); 


/* Transfer Data to MAP from CM */ 

nbytes = NN * sizeof (double); 

read_timer(&dmastart); 

DMA_CPU(CM20BM, al, MAP_OBM_stripe(l, "E"), xx, 1, nbytes, 0); 

wait_DMA (0); 

DMA_CPU(CM20BM, HAM, MAP_OBM_stripe(1, "F"), ham, 1, 
Np* sizeof (double), 0); 

wait_DMA (0); 

read_timer (&dmastop) ; 

dmain= dmastop-dmastart; //get dma time 








//build twiddle table 
for (r=0; r<Np; rt+t+) 





k =r — ((int)Np/2); 
for (j=0; <P; j++) 


twdlI[r][j] = cosf(-pi2*k* j*L/Np) ; 
twdlQ[r] [4] = sinf (-pi2*k*4*L/Np) ; 








/* Channelize: Turn Array into Matrix */ 
for (i=0; i<P; itt) 
{ 
index = 0; 
for (j=i*L+1; j<=i*L+Np; jt+t) 
{ 








X[index] [i] = al[j-1]; 
index++; 


} 


/* The following loop was used to generate data for G. Upperman's 

Thesis. Note: printf statement only works in debug mode. */ 
printf ("********x**x THESTS DATA: CHANNELIZATION *******\n") ; 
for (i=0; i<5; itt) 











printf ("\t"); 
for (j=0; j<2; J++) 
{ 
printf ("32.8f\t", X[il[j]l); 
} 
printf ("\n"); 
} 


/* Apply Hamming window */ 
for (j=0;  3<P; jtt) 
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bff €or. 


/* Get Time Hac */ 
read_timer(&TafterC);//get timing post channelize 
*t_channelize = Tafterc Tstart; 





[ [RRR KKK KKK KK KKK KKK KKK KKK KKK ABP TRST FRET RXKKKKKKKKKKKKKKKKKKKKKKKKKKK KKK 
/* Get Time Hac and calculate timing */ 

read_timer(&Tfftlstart); 
/* Determine Log Base 2 */ 

for (i=8*sizeof(int)-1; i>=0 && ((1<<i) & n)==0; i--); 

log2n = i; 


//build input matrix 
for (a=0; a< P; att) 
{ 





/* Build FFT Input Matrix */ 
for (j=0;  j<Np; J++) 


I[j] = XW[jllal; 
Q[j] = 0.0f; 
be fe tor 7 


/* reorder input and split input into real and complex parts */ 
for (i=0; i<n; itt) 
{ 
/* reverse bits 0 thru k-1 in the integer "a" */ 
for (ii=0=0, p = 1, q = 1<<(log2n-1); 
Li<log2n; 
lit+, p <<= 1, gq >>= 1) if (1 & q) O=o | p; 





/* loop on FFT stages */ 
for (s=1; s<=log2n; stt+) 
{ 





m= 1<<s; /* m= 2%s */ 

tO = pi2/m; 

wml = cosf(t0); /* wm = exp(q*2*pi*i/m); */ 
wm2 = sinf(t0); 

wl = 1.0f; 

w2 = 0.0f; 


for (j=0; j<m/2; J++) 





for (k= 3; k<n; k+=m) 
{ 
/* t = wrea[lk+m/2]; */ 
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ul = al[k2]; 
u2 = aQ[(k2]; 





tl = wl * ul - w2 * u2; 
t2 = wl * u2 + w2 * ul; 
al[k] = ula + tl; 
aQ([k] = u2a + t2; 
al[k2] = ula - tl; 
aQ([k2] = u2a - t2; 
} // for k 
/* w=w * wm; */ 
tl = wl * wml - w2 * wm2 ; 
w2 = wl * wm2 + w2 * wml ; 
wl = tl; 
} // for j 


} // for s 


/* flip the final stage */ 
temp_I_out[0] = al[0]; 
temp_I_out[(int)n/2] = 
temp_Q_out[0] = aQ[0]; 
temp_Q_out[(int)n/2] = 


al[(int)n/2]; 


aQ[(int)n/2]; 





#pragma src parallel sections 
{ 


#pragma src section 





for (r=13- r<n/2? r++) {temp_I_out[r] = al[n-r];} 
for (r=1; r<n/2; r++) {temp_I_out[n-r] = al[r];} 














for (j=l; j<n/2; j++) {temp_Q_out[j] = aQ[n-j];} 
for (j=l; j<n/2; j++) {temp_Q_out[n-j] = aQ[j];} 











/* Take time hac */ 
read_timer(&Tfftldone) ; 
for (j=0; 3<Np; jt+) 





FFTI_out[j] [a] temp_I_out[j]; 
FFTQ_ out[j] [a] = temp_Q_out[j]; 





} // for i 


*t_fftl= Tfftlidone-Tfftlstart; 





/* Implement FFT shift: End Result swaps the top and bottom halves: 
*/ 
for(i=0; i<(Np/2); itt) 
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for(j=0; J<P; jtt) 
{ 












































/* Bottom real half becomes top real half */ 
IXF1[i] [9] = FFTI_out [i+(int) (Np/2)] [3]; 

/* Top real half becomes bottom real half */ 
IXF1l[i+(int) (Np/2)] [3] = FFTI_out [il] [j]; 

/* Bottom imag half becomes top imag half */ 
QXF1[i] [4] = FFTQ_out [i+ (int) (Np/2)][j4]; 

/* Top imag half becomes bottom imag half */ 
QOXF1[it(int) (Np/2)] [3] = FFTQ_out[i] [jl]; 

} // for j 
} // for i 


/* The following nested loop was used to generate data for G. 
Upperman's Thesis. Note: printf only works in debug mode */ 
printf ("********* THESTS DATA: FFT 1 AND SHIFT *****\n"); 
for (i=0; i<5; itt) 
{ 











printf ("\t%2.8f+i*%32.8f\n", IXF1[i] [0], OXF1[i] [0]); 
} 


/* Downconversion —- the short sliding FFT's results are shifted 
to baseband to obtain decimated complex demodulate sequences. 
The transpose of the matrix is taken at the same time. */ 


for(i=0; i<Np; i++) 
{ 


for(j=0; J<P; Jtt) 
{ 















































Ti = twdlI[il]l [jl]; 

Qi = twdlO[il[j]; 

IXE[J] [i] = (IXF1[i] [5] * Ti) - (QXF1[i] [3] * Qi); 

//IXE_copy[j3] [i] =IXE[j] [i]; //identical copy for 
multiplicaiton step 

QXE[J] [i] = (1XF1L[il [9] * QO) + (QXFI[i] [5] * Ti); 

//QXE_copy[3] [i]=OXE[j] [i]; //identical copy for 
multiplication step 


} 


//Done with first FFT and Channelize give to Downconvert and FFT2 
//Synchronize with other chip to make sure its ready 
send_to_bridge_a (); 
// give banks B and C to the other chip 
send_perms (OBM_A|OBM_B|OBM_C|OBM_D); 
//second sync point 
send_to_bridge_a (); 
//final syne and recieve timing information 
recv_from_bridge_a (&fft2,&dc); 
//DONE WORKING ON OTHER CHIP 
xt_fft2=fft2; 
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*t_downconvert=dc; 


//Remove memory permissions on secondary chip 
send_perms (0); 

//timing marker 

read_timer (&dmastart) ; 


// DMA the results back to CPU 

DMA_CPU (OBM2CM, finall, MAP_OBM_stripe(1,"A"), Ifinal, 1, 
P*Np*Np*sizeof (double), 0); 

wait_DMA (0); 


DMA_CPU (OBM2CM, finalQ, MAP_OBM_stripe(1,"B"), Qfinal, 1, 
P*Np*Np*sizeof (double), 0); 
wait_DMA (0); 
read_timer (&dmastop) ; 
dmaout= dmastop-dmastart; 
Tdma=dmaintdmaout; 
*t_DMA=Tdma; 
} 














90 


APPENDIX E. DCS.M 

















// 

// FILE des.c 

// description: File resides on the MAP's Secondary Chip 
// 

// Modified by: LT Wesley Simon 

// For thesis 

// Date Septemeber 2009 

// 

#include <libmap.h> 

#include "FAM_const.c" 


#define n 8 //siz 


void SecondaryChip 


{ 
/* 


Declare Arr 


OBM_BANK_C_ 
OBM_BANK_D_ 
OBM_BANK_A_ 


OBM_BANK_B 
OBM_BANK_E 





/* 








// 


timi 


loat wmAl, 
2; 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 


nal, n 
wmBl1, 
wmCl1, 


Ta[(i 
Ib[ (i 
Ibeve 
Tbodd 
Iceve 
Icodd 
Ic[(i 
Taeve 
Qaeve 


Fh FH FH FR Fh FRE FRA SFA Fh FR Eh Fh Fh Fh DO 








/*SINE and COSINE 





Fh fh 





al_copy[(int)Npl, 
Ta_copy 
Ib_copy 


e of FFT 2 


Q) 


ays in OBM */ 














2D (IXE, double, P, Np) 
2D (QXE, double, P, Np) 
2D (finallI, double, P, Np*Np) 
2D (finalQ, double, P, Np*Np) 
(al, double, NN ) //input data 


Declare Other Variables */ 

ng variables 

int64_t scstart,tdcstop,tfft2stop, 
int64_t nbytes, 
unsigned int ii, 
int ma,mb,mc, 


fFtZ,dc; 


i; yk; “k2;. Ss; log2n; 
PB, Gr 
o,r,a, row, col,d, index; 
wmA2, wAl, wA2, tA0O, tAl, 


tA2, uAl, 


wmB2, 
wmC2, 


wBl, 
wCl, 


wB2, tBO, tBl, tB2, 
we2; C0 EL, C2; 
a2_copy[(int)Np]; 
Qa_copy[ (int)Np]; 
Qb_copy[ (int)Np]; 
int)Np]; 
int)Np]; 
Qbeven[ (int) Np]; 
Qbodd[ (int) Np]; 
, Qceven[ (int)Np]; 
Qcodd[ (int) Np]; 
[ (int) Np]; 
, Llaodd[ (int)Np]; 


uBl, 
uel; 


(int)Np], 
(int)Np], 
Np], Qal 
Np], Qb[ 
int)Np], 


nt 
nt 


~. 


























, Qaodd[ (int)Np]; 


array variables */ 


loat SIN_Array[4];//set to log2n +1 
loat COS_Array[4]; 


/* Declare downconvert Variables */ 
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uA2, uAla, uA2a, 
uB2,uBla,uB2a; 


uC2,uCla,uC2a; 























int L = Np/4; 
Float Ii, Qi, Ij, Q3; //temporary I and Q scratch variables 
Float IXF1[(int)Np] [(int)P], ee 
Float IXE_bram[(int)P][(int)Npl, QXE_bram[ (int)P] [(int)Np]; 
Float IXE_copy[(int)P][(int)Np], QXE_ ones ][ (int) Np]; 
Float IXM[(int)P][(int)Np*Np], QXM[ (int)P] [ (int) Np*Np]j; 
Float I[n],Q[n]; 
Float al[(int)Np], a2[(int)Np]; 
Float temp_I_out[(int)n],temp_Q_out[(int)n]; 
recv_from_bridge_a (); 
recv_perms (); 

//initalize final answer arrays to all zeros 

for (row=0; row<P; row+tt) { 

for (col=0; col<Np*Np; col++t) { 
finall [row] [col]=0; 
finalQ[row] [col]=0; 
}} 

recv_from_bridge_a (); 
//build sin_cos array - used to save sine and cosine resources 


for (index =1l; index<=4; indextt) { 
SIN_Array [index] =sinf (pi2/ (1<<index) ); 
COS_Array [index]=cosf (pi2/ (1<<index) ); 
} 


//Start downconvert 
[ [RRKKKK KKK KK **KDOWNCONVERT-— Continued s* * * ¥* KKK KKK KK KK KKK KKK KK KK KKK 








//make second copy of IXE and Qxe to remove read write dependancies 
in multiplication 
for(i=0; i<Np; i++) 

















for(j=0; j<P; J++) 
{ 
IXE_bram[j] [i] IXE[j] [il]; 
IXE_copy[j] [i]= IXE[j] [i]; //identical copy for 
multiplicaiton step 
QXE_bram[j] [i]= OxXE[j] [1]; 
QOXE_copy[j] [i]= OXE[j][il; //identical copy for 




















multiplication step 


} 





/* Multiplication - the product sequences between each one of the 
complex demodulates and the complex conjugate of the others 
are formed. This forms the area in the bi-frequency plane. */ 
for (i=0; i<Np; i++) 

{ 





for (j=0;  j<Np; j++) 
{ 
for (k=0; k<P; k++) 





Ti = IXE_bram[k] [i]; 
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eights 
SETS FOR) 4 


Qi = QXE_bram[k] [i]; 
Ij = IxXE_copy[k] [3]; 
Qj = QXE_copy[k] [j]; 
IXM[k] [i* (int) Np+j] = 
QXM[k] [i* (int)Nptj] 
} // for k 
} // for j 
be ff stor a 


/* Get last time hac and report */ 


read_timer (&tdcstop) ; 


/* Determine Log Base 2 */ 
for (i=8*sizeof(int)-1; 


log2n = i; 


//build input matrix 
printf ("Iteration of FFT"); 


for (a=0; ax Np*Np; att) 
{ 
printf ("\r%d",a); 
/* Build FFT Input matrix */ 
for (j=0; J<P; J++) 
{ 
Ifj] = IXM[j] lal; 
Q[j] = OXM[j] lal; 
} // for 3 
/* reorder inpu 





for (i=0; 


{ 


/* reverse bits 0 


// col 





((1<<i) & n)==0; 


(Qi * Q3); 
(Qi * Ij); 


t and split input into real and complex parts */ 





i<n; i++) 


for (ii=0o=0, p = 


// printf("\tFirst stage"); 


/* loop on F 


//ma = 


ma=2; 
wmA1 
wmA2 
wAl = 
wA2 = 


for (j 


ti<log2n; 
Litt, pp. <<= 1, 








FT stages */ 
Lect 


= COS_Array[1]; 


SIN_Array[1]; 
Iie O£F 
0.0f; 


=0; j<ma/2; 


thru k-1 


/* m= 248 */ 


j++) 


(1 & q) 


in the integer "a" */ 
1<<(log2n-1); 


I[i];//make copy of data to reduce mul 
= Q[i];//make copy of data to reduce mul 


exp (q*2*pixi/m); */ 


tipl 











tipl 





e reads 
e reads 


for (k=j; k<n; kt+=ma) 

{ 
/* t = wrea[k+m/2]; */ 
k2 = k+ ma/2; 


uAla = al_copy[k2]; 





























uA2a = a2_copy[k2]; 
tAl = wAl * uAla —- wA2 * uA2a; 
tA2 = wAl * uA2a + wA2 * uAla; 
uAl = al[k]; 
uA2 = a2[k]; 
Taeven[k] = uAl tAl; 
Qaeven[k] = uA2 tA2; 
Taodd[k2] = uAl - tAl; 
Qaodd[k2] = uA2 —- tA2; 
} // for k 
/* w=w * wm; */ 
nal = wAl * wmAl —- wA2 * wmA2 ; 
na2 = wAl * wmA2 + wA2 * wmAl ; 
wAl = nal; 
wA2 = na2; 
} // for j 
ITa[0]=Iaeven[0]; 
Ta[{1l]=Iaodd[1]; 
Ta[2]=Iaeven[2]; 
Ta[3]=Iaodd[3]; 
Ta[4]=Iaeven[4]; 
Ta[5]=Iaodd[5]; 
ITa[6]=Iaeven[6]; 
Ta[7]=Iaodd[7]; 
Qa[0]=Qaeven[0]; 
Qa[1]=Qaodd[1]; 
Qa[2]=Qaeven[2]; 
Qa[3]=Qaodd[3]; 
Qa[4]=Qaeven[4]; 
Qa[5]=Qaodd[5]; 
Qa[l6]=OQaeven[6]; 
Qa[7]=Qaodd[7]; 
for (k=0; k<n;k++) { 
Ta_copy[k]=Ial[k]; 
Qa_copy[k]=Qalk]; 





[ [RRR RRR KKK KK KK GHD KKK KK KK KK KK OK KK 


//mb = 1<<2; /* m = 2%s */ 


mb=4; 
wmBl = COS_Array[2]; /* wm = exp(q*2*pi*i/m); */ 
wmB2 = SIN_Array[2]; 


wBl = 1.0f; 
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wB2 = 0.0f; 


for (j=0; j<mb/2; j++) 


{ 





for (k=j; k<n; k+=mb) 


{ 





/* t = wtalk+m/2]; */ 
k2 = k+ mb/2; 

//fetch data needed 
uBla = Ia_copy[k2]; 
uB2a = Qa_copy[k2]; 
































tBl = wBl * uBla —- wB2 * uB2a; 
tB2 = wBl * uB2a + wB2 * uBla; 
uBl = Ia[k]; 
uB2 = Qa[k]; 
Ibeven[k] = uBl tBl; 
QObeven[k] = uB2 tB2; 
Ibodd[k2] = uBl - tBl; 
Qbodd[k2] = uB2 - tB2; 
} // for k 
/* w=w * wm; */ 
tBl = wBl * wmBl —- wB2 * wmB2 ; 
wB2 = wBl * wmB2 + wB2 * wmBl ; 
wBl = tBl; 
} // for j 
Ib[0]=Ibeven[0]; 
Ib[1]=Ibeven[1]; 
Ib[2]=Ibodd[2]; 
Ib[3]=Ibodd[3]; 
Ib[4]=Ibeven[4]; 
Ib[5]=Ibeven[5]; 
Ib[6]=Ibodd[6]; 
Ib[7]=Ibodd[7]; 
Qb[0]=Obeven[0]; 
Qb[1]=OQbeven[1]; 
Qb[2]=Qbodd[2]; 
Qb [3] =Qbodd[3]; 
Qb[4]=Obeven[4]; 
Qb[5]=Obeven[5]; 
Qb [6]=Qbodd[6] ; 
Qb[7]=Qbodd[7]; 
for (k=0; k<n;k++) { 
Ib_copy[k]=Ib[k]; 
Qb_copy[k]=Ob[k]; 





} 


[ [RRR RRR KKK KK KK Gm 3K KK KK RK KK OK KKK KK 


// printf("...Third stage\n"); 
//mc = 1<<3; /* m = 2%8 */ 
mc=8; 
wmCl = COS_Array[3]; /* wm = exp(q*2*pi*i/m) ; 
wmC2 = SIN_Array[3]; 
wCl= 1.0f; 
wC2= 0.0f; 
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*/ 
































for (j=0; j<mc/2; j++) 
{ 
for (k=j; k<n; kt+=mc) 
{ 
/* t = wrea[lk+m/2]; */ 
k2 = k+ mc/2; 
uCla = Ib_copy[k2]; 
uC2a = Qb_copy[k2]; 
tCl = wCl * uCla —- wC2 * uC2a; 
tC2 = wCl * uC2a + wC2 * uCla; 
uCl = Ib[k]; 
uC2 = Qb[k]; 
Iceven[k] = uCl ECCI 
QOceven[k] = uC2 tC2; 
Icodd[k2] = uCl - tCl; 
Qcodd[k2] = uC2 - tC2; 
} // for k 
/* w=w * wm; */ 
tCl = wCl * wmCl —- wC2 * wmC2 ; 
wC2 = wCl * wmC2 + wC2 * wmCl ; 
wCl = tCl; 
} // for j 
Ic[0]=Iceven[0]; 
Ic[1l]=Iceven[1]; 
Ic[2]=Iceven[2]; 
Ic[3]=Iceven[3]; 
Ic[4]=Icodd[4]; 
Ic[5]=Icodd[5]; 
Ic[6]=Icodd[6]; 
Ic{7]=Icodd[7]; 
Qc[0]=OQceven[0]; 
QOc[1]=Oceven[1]; 
Qc[2]=Oceven[2]; 
Qc[3]=Qceven[3]; 
QOc[4]=OQcodd[4]; 
Qc[5]=Qcodd[5]; 
Qc[6]=Qcodd[6]; 
Qc[7]=Qcodd[7]; 
//END STAGES 
/* flip the final stage */ 





temp_I_out 
temp_I_out 
temp_Q_out 
temp_Q_out 


#pragma src paral 


{ 


0] 


0] 





#pragma src section 


{ 


for 


(r=1; 


Ic[0] 
(int) n/2] 
= Qc[0] 
(int) n/2] 


r<n/2; 


7 
= 1 


a 


llel sections 


E+) 


e[ (int)n/2]; 


Qc[(int)n/2]; 


{temp_I_out [r] 
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Ic[n-r];} 


for (r=1; r<n/2; r++) {temp_I_out[n-r] = Ic[r];} 
} 
#pragma src section 


{ 





for (r=1¢ -r<n/23 -r++) {temp_Q_out[r] = Qc 
for (r=1; r<n/2; r++) {temp_Q_out[n-r] = 











for (r=0; r<P; rtt) 


finall[r][{a] = temp_I_out[r]; 
= temp_Q_out[r]; 





i) 
H- 
D 
ia) 
10 
5 
o 

| 


}//for a 
/* Get time hac */ 
read_timer (&tfft2stop) ; 
fft2=tfft2stop-tdcstop; 
dc=tdcstop-scstart; 
send_to_bridge_a (fft2,dc); 
recv_perms (); 
printf("\ndone with program returning to primary\n"); 


} 
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APPENDIX F. MISC CODE 


A. HIGH PRECISION TIMING 


This file creates a structure used to keep track of timing. 


int timeval_subtract (struct timeval *result, struct timeval *x, struct 
timeval *y); 


/* Subtract the ‘struct timeval' values X and Y, 
storing the result in RESULT. 
Return 1 if the difference is negative, otherwise 0. */ 














int 
timeval_subtract (result, x, y) 
struct timeval *result, *x, *y; 


/* Perform the carry for the later subtraction by updating y. */ 
if (x->tv_usec < y->tv_usec) { 





int nsec = (y->tv_usec - x->tv_usec) / 1000000 + 1; 
y->tv_usec —-= 1000000 * nsec; 
y->tv_sec += nsec; 


} 
if (x->tv_usec - y->tv_usec > 1000000) { 


int nsec = (x->tv_usec - y->tv_usec) / 1000000; 
y->tv_usec += 1000000 * nsec; 
y->tv_sec -= nsec; 





} 





/* Compute the time remaining to wait. 
tv_usec is certainly positive. */ 

result-—>tv_sec x->tv_sec y->tv_sec; 

result-—>tv_usec = x->tv_usec - y->tv_usec; 











/* Return 1 if result is negative. */ 
return x->tv_sec < y->tv_sec; 





B. GETFAM_CONSTANT.C 


This file is utilized to create a file of constants that are common to both the .mc 


files. Compile and execute this file to generate the output file FAM _const.c. 


#include <stdio.h> 
#include <string.h> 
#include <math.h> 

#include <stdlib.h> 





/*getFAM_const.c 
original work by Gary Upperman 
Edited by: LT Wesley Simon 
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Added pi2 variable and changed constants from doubles to float by 


adding 'f' to pi and pi2 
a / 


FILE *FAM_ const; 





main () 





/* DECLARE VARIABLES AND CONSTANTS */ 
/* declare file names and path */ 

















char const_file[ 
/* declare sample f 

int fs, df, M; 
/* declare FAM-spci 





] = "FAM const.c"; 
requency, frequency resolution, and M */ 


fic varibles */ 


double dalpha, Np, L, P, N, NN; 


/* Get data from user 


printf("What is the 
scanf("%d", &fs); 


*/ 
sampling frequency (fs) (Hz)? "); 





printf("What is th 
scanf("Sd", &df); 
printf("What is M? 
scanf("%Sd", &M); 
/* Calculate Values */ 
dalpha = df/M; 











frequency resolution desired (df) (Hz)? 


ae 


Np = pow(2.0, ceil(logl0(fs/df)/logl0(2)) ); 

L = Np/4; 

P = pow(2.0, ceil(logl0(fs/dalpha/L)/1log10(2)) ); 
N = P*L; 

NN = (P-1) *L+Np; 





/* PRINT CONSTANT FILE 


Bi. 


FAM_const=fopen("FAM_ const.c", "w"); 
if (FAM_const == NULL) 





{ 








printf("Error creating FAM constant file."); 


return(1); 


} 




















fprintf (FAM_const, 
fprintf (FAM_const, 
fprintf (FAM_const, 
fprintf (FAM_const, 
fprintf (FAM_const, 
fprintf (FAM_const, 
fprintf (FAM_const, 
forintf (FAM_const, 
forintf (FAM_const, 
fFcolose (FAM_const); 


/* END FUNCTION */ 
printf("\nN: %i, NN 
(int)N, (int)NN, 





"#tdefine N %i\n", (int)N); 

"#define NN %i\n", (int)NN); 

"#define Np %i\n", (int)Np); 

"#define P %i\n", (int)P); 

"#define fs %i\n", (int)fs); 

"#tdefine df %i\n", (int)df); 

"#tdefine M %i\n", (int)M); 

"\n#define pi 3.14159265358979f\n"); 
"\n#tdefine pi2 6.28318530717959f\n"); 






































: $i, Np: %i, P: %i\n", 
(int)Np, (int)P); 


printf("Constant File written for Cyclostationary FAM 


Analysis.\n\n"); 
return 0; 


} 
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